CN119541475A

CN119541475A - Audio processing method and system for speech spectrum reconstruction

Info

Publication number: CN119541475A
Application number: CN202510091917.5A
Authority: CN
Inventors: 沈生猛; 张平; 胡赛伟
Original assignee: Dongguan Huaze Electronic Technology Co ltd
Current assignee: Dongguan Huaze Electronic Technology Co ltd
Priority date: 2025-01-21
Filing date: 2025-01-21
Publication date: 2025-02-28
Anticipated expiration: 2045-01-21
Also published as: CN119541475B

Abstract

The invention relates to the technical field of audio processing, and discloses an audio processing method and system for reconstructing a speech spectrum. The method comprises: preprocessing an original speech signal collected by a vehicle-mounted microphone to obtain a speech signal time-frequency matrix; performing singular value decomposition on the speech signal time-frequency matrix to obtain a sparse matrix representing speech components and a low-rank matrix representing noise components, and establishing a frequency band feature association matrix by a mutual information entropy calculation method; inputting the sparse matrix into a selective forgetting extreme learning machine to predict speech features to obtain predicted speech feature data; based on the predicted speech feature data and the frequency band feature association matrix, using a logarithmic spectrum amplitude calculation method to perform speech probability calculation and feature interpolation on each time-frequency point to obtain reconstructed spectrum data. The invention improves the accuracy of speech recognition and meets the requirements of real-time processing in a vehicle-mounted environment.

Description

Audio processing method and system for speech spectrum reconstruction

Technical Field

The present invention relates to the field of audio processing technologies, and in particular, to an audio processing method and system for speech spectrum reconstruction.

Background

With the wide application of vehicle-mounted voice interaction systems, accurate recognition of voice commands of drivers in complex driving environments is an important issue. However, in the environment in the vehicle, there are various interference sources such as engine noise, road noise, wind noise, etc., and these noise and voice signals are highly overlapped in the time-frequency domain, which seriously affects the quality and the recognizability of the voice signals.

The traditional voice enhancement method is mainly based on the technologies of frequency spectrum subtraction, wiener filtering and the like, and although background noise can be restrained to a certain extent, the problems of music noise, voice distortion and the like are often introduced. Particularly under the condition of low signal-to-noise ratio, the methods are difficult to effectively distinguish voice and noise components, so that the voice recognition accuracy is obviously reduced.

Disclosure of Invention

The invention provides an audio processing method and system for voice spectrum reconstruction, which improve the accuracy of voice recognition and meet the requirement of real-time processing in a vehicle-mounted environment.

In a first aspect, the present invention provides an audio processing method for speech spectrum reconstruction, where the audio processing method for speech spectrum reconstruction includes:

Preprocessing an original voice signal acquired by a vehicle-mounted microphone to obtain a voice signal time-frequency matrix;

singular value decomposition is carried out on the voice signal time-frequency matrix to obtain a sparse matrix representing voice components and a low-rank matrix representing noise components, and a frequency band characteristic association matrix is established through a mutual information entropy calculation method;

inputting the sparse matrix into a select forgetting extreme learning machine to conduct voice feature prediction to obtain predicted voice feature data;

And carrying out voice probability calculation and feature interpolation on each time-frequency point by adopting a logarithmic spectrum amplitude calculation method based on the predicted voice feature data and the frequency band feature association matrix to obtain reconstructed spectrum data.

In a second aspect, the present invention provides an audio processing system for speech spectrum reconstruction, the audio processing system for speech spectrum reconstruction comprising:

The preprocessing module is used for preprocessing the original voice signals acquired by the vehicle-mounted microphone to obtain a voice signal time-frequency matrix;

the singular value decomposition module is used for carrying out singular value decomposition on the voice signal time-frequency matrix to obtain a sparse matrix representing voice components and a low-rank matrix representing noise components, and establishing a frequency band characteristic association matrix through a mutual information entropy calculation method;

The prediction module is used for inputting the sparse matrix into a select forgetting extreme learning machine to predict the voice characteristics so as to obtain predicted voice characteristic data;

and the calculation module is used for carrying out voice probability calculation and characteristic interpolation on each time-frequency point by adopting a logarithmic spectrum amplitude calculation method based on the predicted voice characteristic data and the frequency band characteristic association matrix to obtain reconstructed spectrum data.

According to the technical scheme, the selection forgetting limit learning machine is introduced to predict voice characteristics, the selection mechanism and the forgetting mechanism are combined, time sequence dependency of voice signals can be effectively captured, prediction accuracy of the voice characteristics is improved, a matrix decomposition method based on singular value decomposition is adopted to decompose the voice signals into a sparse matrix and a low-rank matrix, a frequency band characteristic association matrix is built through mutual information entropy calculation, effective separation of voice components and noise components is achieved, the logarithmic spectrum amplitude calculation and characteristic interpolation technology are combined, accurate reconstruction is carried out on time points, continuity and naturalness of the voice signals are effectively maintained, an adaptive wiener filtering and dynamic gain control strategy is designed, filter parameters and gain coefficients are dynamically adjusted according to signal to noise ratio, voice enhancement effect is remarkably improved, a deep neural network model is adopted to conduct voice recognition, the long-short-term dependency of the voice characteristics is effectively extracted through a bidirectional long-short-term memory network and a self-attention mechanism, accuracy of voice recognition is improved, and requirements of real-time processing under a vehicle-mounted environment are met.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained based on these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram of steps of an audio processing method for speech spectrum reconstruction in an embodiment of the present invention;

fig. 2 is a schematic structural diagram of an audio processing system for speech spectrum reconstruction according to an embodiment of the present invention.

Detailed Description

The embodiment of the invention provides an audio processing method and system for reconstructing a voice spectrum. The terms "first," "second," "third," "fourth" and the like in the description and in the claims and in the above drawings, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments described herein may be implemented in other sequences than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.

For easy understanding, the following describes a specific flow of an embodiment of the present invention, referring to fig. 1, and an embodiment of an audio processing method for speech spectrum reconstruction in an embodiment of the present invention includes:

Step S1, preprocessing an original voice signal acquired by a vehicle-mounted microphone to obtain a voice signal time-frequency matrix;

It will be appreciated that the execution subject of the present invention may be an audio processing system for speech spectrum reconstruction, and may also be a terminal or a server, which is not limited herein. The embodiment of the invention is described by taking a server as an execution main body as an example.

Specifically, the original voice signal collected by the vehicle-mounted microphone exists in the form of an analog signal, analog-to-digital conversion is carried out on the original voice signal, and the analog signal is converted into the form of a digital signal, so that voice signal data is obtained. And performing direct current component elimination processing on the voice signal data, and converting the signal into a zero-mean voice signal by subtracting the mean value of the signal, so as to eliminate the influence of direct current offset on subsequent processing. And carrying out segmentation interception on the zero-mean voice signal to obtain overlapped voice signal fragments. When in segmentation, the method of overlapping interception can effectively avoid the phenomenon of spectrum leakage and ensure the continuity of frequency information. Namely, a certain overlapping part is arranged between adjacent signal segments of each segment, so that the transition smoothness of each segment of signal is ensured, and the time-frequency resolution is improved. And weighting the segmented voice signal segment by applying a preset window function. The Hamming window function is selected as a window function, and the boundary of the signal is smoothed by multiplying the signal segments point by the weighting coefficient of the Hamming window function, so that the spectrum leakage caused by signal truncation in spectrum analysis is reduced. The Hamming window has better spectrum smoothing performance, and balances the main lobe width of the frequency domain and the side lobe suppression. A fast fourier transform is applied to the weighted speech signal segments. The fast fourier transform is a discrete fourier transform implementation, which converts a time-domain signal into a frequency-domain signal, revealing the frequency characteristics of the speech signal. After the fast fourier transform, the signal is represented as a complex spectrum containing amplitude information and phase information, where the amplitude part reflects the energy distribution of the signal at each frequency and the phase part describes the relative phase of the frequency components. The complex spectrum is processed using a short-time fourier transform. The short-time fourier transform obtains a time-varying amplitude spectrum and a time-varying phase spectrum of the speech signal by moving a window function stepwise on a time axis and fourier transforming the signal within each window. The time-varying amplitude spectrum reflects the distribution of the energy of the signal over time, while the time-varying phase spectrum records the phase change of the frequency components. And re-synthesizing the time-varying amplitude spectrum and the time-varying phase spectrum in a complex domain to obtain complex frequency spectrum characteristics. And filling the complex frequency spectrum characteristics of the voice signals into a two-dimensional matrix according to a time sequence to obtain a voice signal time-frequency matrix, wherein the matrix consists of N rows and M columns, N represents a frequency dimension, and M represents a time dimension.

S2, performing singular value decomposition on a voice signal time-frequency matrix to obtain a sparse matrix representing voice components and a low-rank matrix representing noise components, and establishing a frequency band characteristic association matrix through a mutual information entropy calculation method;

Specifically, singular value decomposition operation is performed on a time-frequency matrix of a voice signal, and a decomposition form of U-S-V is adopted, wherein U and V are respectively orthogonal matrices, S is a diagonal matrix, and all singular values are contained. The decomposition mode decomposes the time-frequency matrix of the voice signal into the product of three matrices, thereby separating out the information characterizing the different characteristics of the signal. The singular value sequences are sorted according to the amplitude values, the voice components and the noise components are distinguished, the singular values with larger amplitude mainly correspond to the obvious characteristics of the voice signals, and the singular values with smaller amplitude mainly reflect the noise or background components. Dividing the singular value sequences based on a preset threshold value to respectively obtain a voice singular value sequence and a noise singular value sequence. And constructing a new diagonal matrix based on the voice singular value sequence, and performing three-matrix multiplication operation on the diagonal matrix and the U matrix and the V matrix in the decomposition matrix group to generate a sparse matrix representing voice components. The significant features of the speech signal are preserved in the sparse matrix and a higher concentration of speech components is achieved by reducing noise interference. Meanwhile, a diagonal matrix is constructed based on the noise singular value sequence, and three-matrix multiplication operation is carried out on the diagonal matrix and a U matrix and a V matrix in a decomposition matrix group, so that a low-rank matrix representing noise components is obtained, the low-rank matrix mainly reflects the smooth distribution characteristic of background noise, and the low-rank matrix is a supplement to a voice sparse matrix. And carrying out sub-band division on the frequency band characteristics of the voice sparse matrix according to the Mel frequency scale. The mel frequency scale is a nonlinear frequency scale which accords with human auditory perception, and the division mode can better reflect the perception characteristics of the voice signal on different frequency bands. The sparse matrix is decomposed into a sequence of sub-matrices of K frequency bands, each sub-matrix corresponding to a speech feature on one frequency band. Pairing the sub-matrix sequences of the K frequency bands pairwise, and calculating the mutual information entropy value of each pair of sub-matrices. The mutual information entropy is an index for measuring the statistical correlation between two variables and is used for quantifying the degree of correlation between different frequency band sub-matrices, so as to obtain a mutual information entropy matrix. And carrying out symmetry inspection and normalization processing on the mutual information entropy matrix. The symmetry check can verify whether the mutual information entropy matrix meets the symmetry requirement of frequency band association, and the normalization process is to map matrix element values into a standardized range, usually between 0 and 1, so that the association degree between different frequency bands is more visual. And constructing a frequency band characteristic association matrix based on the frequency band correlation data. The element values of the matrix directly characterize the degree of association between different frequency bands, e.g. a larger value indicates a stronger association between two frequency bands.

S3, inputting the sparse matrix into a selective forgetting extreme learning machine to predict voice characteristics, and obtaining predicted voice characteristic data;

Specifically, the sparse matrix is subjected to segmentation processing according to a time axis, each segment represents voice characteristic data in a time window, and each segment of voice characteristic is used as a training sample to form a voice characteristic training sequence. The segmentation processing mode can effectively capture the local characteristic of the voice signal in time, and meanwhile, the calculation complexity is reduced. Connection weights between an input layer and an hidden layer of the forgetting extreme learning machine are randomly generated and selected based on the voice feature training sequence. By means of random generation, an input weight matrix is constructed, each column of which represents the weight distribution of a neuron in the hidden layer. And (3) performing matrix multiplication operation on the input weight matrix and the voice characteristic training sequence, and simultaneously superposing a bias term to map the linear calculation result to an implicit layer. In order to introduce nonlinear characteristics, an activation operation is performed on an input value of the hidden layer by adopting a hyperbolic tangent function (tanh), so as to generate an output characteristic of the hidden layer. And (3) calculating Moore-Penrose generalized inverse matrix of the hidden layer output characteristics, and solving the weight matrix from the hidden layer to the output layer in an analytic mode. The analysis weight matrix optimizes the weight distribution of the hidden layer by minimizing the output error, thereby ensuring the accuracy of the voice characteristic prediction. And screening the hidden layer output characteristics based on a selection mechanism. The selection mechanism measures the similarity of each feature by calculating the euclidean distance between it and the central feature. The center feature represents the feature mean or weighted center within a time window, while the higher similarity features reflect the dominant speech characteristics within the current time window. Through this step, a subset of speech features is screened out, thereby removing redundant or noisy features. After feature screening is completed, the time attenuation processing is carried out on the features at each moment in the voice feature subset through a forgetting mechanism. The forgetting mechanism calculates the attenuation coefficient according to the time interval of the characteristic, the coefficient is attenuated in an exponential form, and the characteristic that the influence of the historical characteristic on the current characteristic in the voice signal is gradually weakened can be simulated. Through the process, the history features are forgotten gradually, so that the input feature set of the model is dynamically updated to be more fit with the actual situation of the current voice signal. And performing matrix multiplication operation on the updated voice characteristics and the analysis weight matrix to generate a voice characteristic predicted value at the next moment. And sequentially splicing the prediction results at each moment to reconstruct the characteristic sequence of the whole voice signal so as to obtain the predicted voice characteristic data.

And S4, carrying out voice probability calculation and feature interpolation on each time-frequency point by adopting a logarithmic spectrum amplitude calculation method based on the predicted voice feature data and the frequency band feature incidence matrix to obtain the reconstructed spectrum data.

Specifically, the log-amplitude calculation is performed on the predicted voice characteristic data, and the amplitude information of the voice signal is re-expressed in a log form, so that the data distribution accords with the perception characteristics of human ears. And carrying out normalization processing on the logarithmic amplitude data through a frequency response function, and ensuring that the amplitude characteristics of different frequency ranges are on a unified numerical scale to obtain a standardized logarithmic spectrum sequence. And constructing a frequency band energy mapping relation based on the frequency band characteristic association matrix. The frequency band characteristic association matrix describes the relativity between different frequency bands, the energy contribution of each frequency band is determined through the matrix, and the mapping relation is utilized to carry out weighted summation operation on the standardized logarithmic frequency spectrum sequence according to the frequency bands, so as to obtain the frequency band energy distribution matrix. The matrix reflects the energy distribution characteristics of the voice signal at each time-frequency point by taking time and frequency as dimensions. And calculating Euclidean distance between each time frequency point and adjacent time frequency points in the frequency band energy distribution matrix so as to measure similarity between the time frequency points. And comparing the time-frequency point with a preset distance threshold value, judging whether the time-frequency point belongs to a high probability area of the voice signal, performing voice probability calculation, and generating a time-frequency probability distribution matrix. And carrying out clustering analysis on the voice time-frequency points according to the time-frequency probability distribution matrix. The clustering analysis divides the time-frequency points into a voice main area and a transition area according to the voice probability of the time-frequency points, and generates an area marking matrix according to the voice main area and the transition area. The speech body region contains significant speech signal features, while the transition region contains edge portions or low probability speech features of the speech signal. And under the support of the region marking matrix, extracting the outline of the voice main body region, and calculating the optimal voice boundary track. The voice boundary track can accurately depict a main time-frequency region of a voice signal, and provides a reference for boundary processing and interpolation calculation of a frequency spectrum. And performing cubic spline interpolation calculation on the time-frequency points of the transition region based on the voice boundary track, and generating spectrum interpolation data of the transition region. And carrying out weighted fusion on the spectrum data of the voice main body area and the spectrum interpolation data of the transition area to form a logarithmic spectrum matrix. The logarithmic spectrum matrix is subjected to an inverse logarithmic transformation to restore the original amplitude information of the signal. Meanwhile, phase reconstruction is carried out to supplement the phase part of the frequency spectrum, new phase values are generated by combining the original phase information of the time-frequency matrix or based on a consistency algorithm, and finally the reconstructed frequency spectrum data are obtained.

And carrying out power spectrum calculation on each time-frequency point of the reconstructed spectrum data, obtaining a power spectrum by squaring the modulus value of the complex spectrum, and reflecting the energy distribution characteristic of the signal. In order to smooth out transient fluctuations in the power spectrum, it is subjected to a moving average process. The power spectrum values at a plurality of continuous moments are averaged on a time axis by the moving average, so that the influence of short-time random fluctuation is effectively reduced, and a smoothed power spectrum matrix is obtained. And calculating the power spectrum value of the background noise through frequency band partition based on the smoothed power spectrum matrix, and generating a noise power spectrum matrix by adopting a minimum tracking algorithm or a statistical model so as to reflect the characteristics of the background noise in each frequency band. And calculating the prior signal-to-noise ratio and the posterior signal-to-noise ratio of each frequency band according to the ratio relation between the smoothed power spectrum matrix and the noise power spectrum matrix. The prior signal-to-noise ratio estimates the predicted power of the voice signal through the historical characteristics, and the posterior signal-to-noise ratio is directly calculated based on the ratio of the power spectrum value to the noise power spectrum value at the current moment. Combining the two to form a signal to noise ratio parameter matrix. The signal to noise ratio parameter matrix is weighted through the transfer function of the wiener filter, and the wiener filter can optimize the frequency response of the filter according to the signal to noise ratio to generate the self-adaptive filter coefficient matrix. The matrix adaptively adjusts the gain of the signal in the frequency domain, effectively suppressing the noise component. And performing complex matrix multiplication operation by using the adaptive filter coefficient matrix and the reconstructed spectrum data to obtain a filtered voice spectrum. Dividing the filtered voice frequency spectrum according to sub-bands, analyzing the energy characteristic of each sub-band, calculating the energy mean value and variance of each sub-band, reflecting the dynamic change characteristic of the frequency spectrum signal, and further being used for generating dynamic gain control parameters. The dynamic gain control parameter performs nonlinear mapping operation and signal amplitude compensation according to the energy characteristics of each sub-band. The nonlinear mapping employs a compression or expansion operation to ensure that the amplitude characteristics of the signal more closely match the auditory perception of the human ear. And (3) recovering the original dynamic characteristics of the adjusted voice frequency spectrum through signal amplitude compensation to obtain the voice frequency spectrum after gain compensation. And restoring the voice frequency spectrum after gain compensation to a time domain signal through inverse short time Fourier transform. In the inverse transformation process, windowing and overlap-add operations are performed simultaneously to ensure smooth transitions and continuity of the signal on the time axis, resulting in an enhanced speech signal. Mel-frequency cepstral coefficient (MFCC) features are extracted for enhanced speech signals. MFCC features are a feature representation method widely used in speech recognition, which maps spectral characteristics of a speech signal to a nonlinear frequency space of human auditory perception based on mel frequency scale, and extracts feature parameters containing speech information through cepstrum calculation. And inputting the extracted MFCC features into a deep neural network model for voice recognition. The deep neural network can learn the semantic information of the signal from the complex voice characteristics by utilizing the characterization capability of the deep neural network, and a voice recognition result is obtained.

The enhanced speech signal is preprocessed, and the signal is framed and then is subjected to 512-point fast Fourier transform, so that the time domain signal is converted into the frequency domain, and the frequency components of the signal are revealed. After obtaining the spectrum, the spectrum is filtered based on 26 triangular mel filter banks, and the energy distribution of each frequency band is calculated to generate a mel frequency band energy spectrum. The filters divide frequency bands according to the Mel frequency scale, and conform to the auditory perception characteristics of human ears, so that the frequency distribution of the voice signals has more auditory relevance. The mel-band energy spectrum is logarithmically computed to compress its dynamic range while making the data more consistent with the energy-aware characteristics of the speech signal. The log energy spectrum is mapped from the frequency domain to the cepstral domain by discrete cosine transform, thereby removing correlation between frequency bands and extracting representative mel-frequency cepstral coefficients. The first 13 cepstrum coefficients are selected as static features to capture the main spectral characteristics of the speech signal. In order to enhance time dynamic information of the features, first-order difference features and second-order difference features are calculated on static features, and change speed and acceleration of the features are respectively represented to form a complete Mel frequency cepstrum coefficient feature set. And inputting the extracted mel-frequency cepstrum coefficient characteristics into a bidirectional long-short-time memory network (BiLSTM) layer of the deep neural network model for time sequence modeling. The two-way LSTM comprises 256 memory cells, which capture context information in a long time range in a speech signal by modeling the feature sequence in both forward and backward directions simultaneously. By bi-directional modeling, the generated context coding features can simultaneously consider the features of the current speech frame and the dependency relationship of the previous and subsequent frames. The context-encoding features are input into the self-attention layer of the deep neural network model, the design of which contains 8 attention heads, each head having a dimension of 32. Through the dot product attention mechanism, the self-attention layer dynamically assigns attention weights, capturing correlations between different time steps in a feature sequence. The attention mechanism generates weighted contextual features by weighting the features so that the model can more effectively focus on the semantically related key information in the speech signal. The weighted context features are input into a 3-layer feedforward neural network of the deep neural network to perform high-layer feature extraction. Each layer of feed forward neural network contains 512 neurons and uses a ReLU activation function to introduce nonlinear expression capability. In order to alleviate the problem of gradient disappearance and improve the training efficiency of the network, residual connection is adopted between each layer of feedforward neural network, so that the model can better capture complex voice characteristics. And generating high-level semantic features through layer-by-layer processing of the multi-layer feedforward neural network. And (3) performing sequence labeling on the high-level semantic features, calculating state transition probability based on the dependency relationship between the context information and the feature sequence, and generating a phoneme recognition sequence. The phoneme recognition sequence represents the smallest linguistic unit of the speech signal, but has not yet had complete semantic information. In order to improve the accuracy of the recognition result, the phoneme recognition sequence is input into a language model for decoding processing. The decoding process applies a language-level constraint to the phoneme sequence by using a pre-constructed dictionary and grammar rules, corrects recognition errors therein, and finally generates a speech recognition result.

According to the embodiment of the invention, the voice characteristics are predicted by introducing a selective forgetting extreme learning machine, the time sequence dependency relationship of the voice signals can be effectively captured by combining a selection mechanism and a forgetting mechanism, the prediction precision of the voice characteristics is improved, the voice signals are decomposed into a sparse matrix and a low-rank matrix by adopting a matrix decomposition method based on singular value decomposition, a frequency band characteristic association matrix is established through mutual information entropy calculation, the effective separation of voice components and noise components is realized, the accurate reconstruction is carried out on time frequency points by combining logarithmic spectrum amplitude calculation and characteristic interpolation technology, the continuity and naturalness of the voice signals are effectively maintained, an adaptive wiener filtering and dynamic gain control strategy is designed, the filter parameters and gain coefficients are dynamically adjusted according to the signal to noise ratio, the voice enhancement effect is obviously improved, the voice recognition is carried out by adopting a deep neural network model, the long-short-term dependency relationship of the voice characteristics is effectively extracted by a bidirectional long-short-term memory network and a self-attention mechanism, the accuracy of the voice recognition is improved, and the real-time processing requirement under a vehicle-mounted environment is met.

In a specific embodiment, the process of executing step S1 may specifically include the following steps:

performing analog-to-digital conversion on an original voice signal acquired by a vehicle-mounted microphone to obtain voice signal data, and performing direct current component elimination on the voice signal data to obtain a zero-mean voice signal;

sectionally intercepting the zero-mean voice signal to obtain overlapped voice signal fragments, inputting the overlapped voice signal fragments into a preset Hamming window function for weighting calculation to obtain windowed voice signal data;

Performing fast Fourier transform on the windowed voice signal data to obtain a complex frequency spectrum containing amplitude information and phase information;

Performing time-frequency analysis on the complex frequency spectrum through short-time Fourier transform to obtain a time-varying amplitude spectrum and a time-varying phase spectrum of the voice signal, and performing complex domain synthesis on the time-varying amplitude spectrum and the time-varying phase spectrum of the voice signal to obtain time-frequency characteristic data of the voice signal;

and constructing a matrix structure of N rows and M columns according to the time-frequency characteristic data of the voice signals, and filling data according to a time sequence relationship to obtain a time-frequency matrix of the voice signals, wherein N represents a frequency dimension and M represents a time dimension.

Specifically, analog-to-digital conversion is performed on the collected analog voice signal. For analog signalsRepresentation of whereinIs a continuous time variable which is a function of the time,Is the signal at the momentIs a function of the magnitude of (a). By sampling operations, converting it into discrete-time signalsWhereinIs a discrete time index, representing the firstA number of sampling points are used to sample the sample,Is the discrete signal at the indexAmplitude of the upper. Sampling frequencyMore than twice the maximum frequency of the speech signal is chosen to meet the nyquist sampling theorem. The output of the analog-to-digital conversion is discrete speech signal data. Upon acquisition of a discrete signalThen, to eliminate the DC component in the signal, the average value of the signal is calculatedIt is defined as:

;

Wherein, Is the total sampling point number of the signal; Is the signal at the first The magnitudes of the sample points. By subtracting the mean value from the signalObtaining a zero-mean signal. The zero-mean processing can effectively remove DC offset in the signal, and ensures that the subsequent processing is concentrated on the dynamic component of the signal. And carrying out segmentation processing on the zero mean signal so as to adapt to the requirement of short-time analysis. The segmentation process is realized by a sliding window with fixed length, and the window length is thatThe sliding step length of the window is. The result of the signal segmentation is a plurality of segments with overlapping segments, each forRepresentation of whereinIs the segment index, representing the firstThe number of segments to be processed is one,Is the sample index within a segment, defined as:

;

Wherein, Is a zero-mean signal; Is the first The starting position of the individual segments; is the window length, which represents the number of sample points each segment contains. The purpose of the overlapped segments is to ensure the time sequence continuity of the signals and avoid information loss. For each segment of signal Weighting is performed by applying a preset hamming window function. Hamming window functionThe definition is as follows:

;

Wherein, Is a sample point index representing the position of the currently calculated point in the window; Is the window length, controlling the width of the window function. The main function of the hamming window is to reduce the spectral leakage effects of the signal. Windowed signal The expression is:

;

Wherein, Is the amplitude of the segmented signal; Is a Hamming window coefficient, in the range of [0,1]. By weighting, the boundaries of the signal segments become smoothed, thereby improving the accuracy of subsequent spectral analysis. For windowed signals Performing fast Fourier transform to convert the time domain signal into frequency domain to obtain complex frequency spectrumThe calculation formula is as follows:

;

Wherein, Is complex spectrum, representing the firstEach time segment is at frequencySpectral values at; is the windowed signal; Is a frequency index in the range of ;Is the window length, representing the number of samples per segment of signal; Is an imaginary unit, satisfies . Complex spectrumExpressed in polar formWherein, the method comprises the steps of, wherein,Amplitude spectrum, representing frequencyThe amount of energy at the location; Phase spectrum, representing frequency Phase information at. Processing complex spectrum into time-varying amplitude spectrum by short-time Fourier transformAnd time-varying phase spectrum. The result of the short-time fourier transform is a time-varying set of spectra that accurately captures the characteristics of the speech signal in both the time and frequency domains. In order to reconstruct the complex domain features of the signal, the time-varying amplitude spectrum and the time-varying phase spectrum are re-synthesized as:

;

Thereby generating time-frequency characteristic data of the signal. Construction from time-frequency characteristic data In whichRepresenting the frequency dimension equal to(Determined by the frequency points of the fast fourier transform),Representing the time dimension, is determined by the total length of the signal and the segmentation step size. Each column of the time-frequency matrix corresponds to a time segment, each row corresponds to a frequency component, and the elements of the matrixIs defined as

;

Wherein, Is an element in the time-frequency matrix representing a time segmentAnd frequency pointIs a complex spectral value of (1); Is the first The first time segmentAmplitude values of the frequency points; Is the first The first time segmentPhase values of the frequency points. By filling each column of the matrix in time sequence, a complete speech signal time-frequency matrix is generated.

In a specific embodiment, the process of executing step S2 may specifically include the following steps:

performing U-S-V singular value decomposition operation on the voice signal time-frequency matrix to obtain a decomposition matrix group containing voice and noise mixing characteristics;

Sorting singular values in the decomposition matrix group according to the magnitude, and carrying out separation treatment on voice components and noise components based on a preset threshold value to obtain a voice singular value sequence and a noise singular value sequence;

Constructing a diagonal matrix based on the voice singular value sequence, and performing three-matrix multiplication operation on the diagonal matrix and a U matrix and a V matrix in a decomposition matrix group to obtain a sparse matrix representing voice components;

constructing a diagonal matrix based on the noise singular value sequence, and performing three-matrix multiplication operation on the diagonal matrix and a U matrix and a V matrix in a decomposition matrix group to obtain a low-rank matrix representing noise components;

Sub-band division is carried out on the sparse matrix according to the Mel frequency scale to obtain a sub-matrix sequence of K frequency bands, the sub-matrix sequences of the K frequency bands are paired in pairs, and mutual information entropy values are calculated for each pair of sub-matrices to obtain a mutual information entropy matrix;

And carrying out symmetry inspection and normalization processing on the mutual information entropy matrix to obtain frequency band correlation data, constructing a frequency band characteristic association matrix based on the frequency band correlation data, and representing the association degree among different frequency bands by matrix element values of the frequency band characteristic association matrix.

In particular, for a time-frequency matrix of speech signalsSingular value decomposition is applied to decompose the singular value decomposition into products of three matrices, and the expression is:

;

Wherein, A time-frequency matrix representing a speech signal, the dimension of which isWhereinIs the frequency dimension, represents the number of rows of the matrix, corresponds to the number of frequency components of the speech signal,The time dimension represents the column number of the matrix and corresponds to the time frame number of the voice signal; is a left singular matrix with dimensions of Each column isThe left singular vector of (2) is used for representing characteristic information of a frequency domain; Is a diagonal matrix with dimensions of Diagonal elements being singular valuesThe singular values are ordered according to the magnitude of the amplitude values and are used for measuring the importance of the corresponding singular vectors; Is a right singular matrix with dimensions of Each column isIs used for characterizing the characteristic information of the time domain. Time-frequency matrix through singular value decompositionDecomposed into matrix sets characterizing signal features, wherein singular valuesIs a diagonal matrixIs used for reflecting the importance degree of each component in the signal. According to a preset threshold valueThe singular values are classified into a speech singular value sequence and a noise singular value sequence. Speech singular value sequencesComprising an amplitude greater thanIs a sequence of noise singular valuesComprising amplitude less than or equal toIs a singular value of (c). The mathematical definition is as follows:

;

Wherein the method comprises the steps of AndA diagonal matrix of singular values of speech and noise respectively,AndIs a column-row index of the matrix, representing where the singular values are located. Based on a sequence of speech singular valuesConstructing a speech sparse matrixThe calculation formula is as follows:

;

Wherein the method comprises the steps of Is a sparse matrix of the voice, characterizes the main characteristics of the voice components,Is a left singular matrix obtained by decomposition,Is a diagonal matrix of speech singular values,Is the transpose of the right singular matrix. Similarly, based on a sequence of noise singular valuesConstructing a noise low rank matrixThe formula is as follows:

;

Wherein the method comprises the steps of Is a noise low rank matrix, is used to characterize the noise components,Is a diagonal matrix of noise singular values. Sparse matrix for speechSub-band division is performed according to a mel frequency scale, wherein the mel frequency scale is a nonlinear frequency division mode, and the calculation formula is as follows:

;

Wherein the method comprises the steps of Is a linear frequency, representing the frequency component of the signal, melIs the corresponding mel frequency for reflecting the perception of frequency by the human ear. According to the dividing result, willIs decomposed intoEach sub-matrix corresponds to a frequency band and is marked as. Then pairwise pairing the submatrices to calculate the entropy value of the mutual informationThe formula is:

;

Wherein the method comprises the steps of Is a submatrixAndIs used to determine the mutual information entropy value of (1),Is the joint probability distribution of them,AndRespectively areAndFor quantifying information correlation between frequency bands. After calculating mutual information entropy values of all pairs, constructing a mutual information entropy matrixWherein. To ensure matrix symmetry and normalizationAnd (3) carrying out symmetry inspection and normalization processing, wherein the normalization formula is as follows:

;

Wherein the method comprises the steps of Is the largest element value in the mutual information entropy matrix. Mutual information entropy matrix based on normalizationConstructing a frequency band characteristic association matrixIts elementsCharacterizing frequency bandsSum frequency bandDegree of association between the two. Through this process, separation of the voice and noise components is achieved, and correlation information between the frequency characteristics of the voice signal is extracted.

In a specific embodiment, the process of executing step S3 may specifically include the following steps:

segmenting the sparse matrix according to a time axis, and taking each segment of voice characteristics as a training sample to obtain a voice characteristic training sequence;

Randomly generating connection weights between an input layer and an hidden layer of the selective forgetting extreme learning machine based on the voice characteristic training sequence to obtain an input weight matrix;

multiplying the input weight matrix with the voice feature training sequence, superposing the bias items, and performing activation operation through a hyperbolic tangent function to obtain hidden layer output features;

performing Moore-Penrose generalized inverse matrix calculation on the hidden layer output characteristics to obtain an analysis weight matrix from the hidden layer to the output layer;

Screening the hidden layer output features based on a selection mechanism, and calculating the similarity between each feature and the central feature through Euclidean distance to obtain a voice feature subset;

Calculating time attenuation coefficients of the features at each moment in the voice feature subset based on a forgetting mechanism, and forgetting the historical features to obtain updated voice features;

Multiplying the updated voice feature with the analysis weight matrix to obtain a voice feature predicted value at the next moment, and sequentially splicing the predicted voice feature predicted values to obtain predicted voice feature data.

In particular, for sparse matricesAnd carrying out segmentation processing according to a time axis, and dividing the matrix into a plurality of small segments, wherein each segment represents the voice characteristics in a time range. Sparse matrixIs of the dimension ofWhereinThe frequency dimension is represented as such,Representing the time dimension. By selecting a fixed segment lengthOverlapping step sizeEach segment of characteristics is formulatedRepresentation of whereinIs a segment index of the segment,Is the firstCharacteristic submatrices of the segments. After segmentation, each segment of characteristics is flattened into a one-dimensional vector to form a voice characteristic training sample sequence, which is recorded asWhereinIs the firstFlattened feature vector of a segment with dimensions ofIs the total number of samples after segmentation. Connection weights between an input layer and an hidden layer of the forgetting extreme learning machine are randomly generated and selected based on the voice feature training sequence. Assuming implicit layer inclusionThe neurons are then input with a weight matrixIs of the dimension ofWhereinIs the number of neurons in the hidden layer,Is the dimension of the input vector. Inputting elements of a weight matrixRandomly generating according to uniform distribution or Gaussian distribution, wherein the formula is as follows:

;

Wherein the method comprises the steps of Is the firstHidden layer neurons and thThe weight of the connection between the individual input features,Is a range of uniform distribution and is,AndThe mean and variance of the gaussian distribution, respectively. Will input a weight matrixTraining samples with speech featuresMultiplying and superimposing an offset vectorThen nonlinear mapping is carried out through a hyperbolic tangent activation function (tanh) to obtain the output characteristics of the hidden layer. The formula is:

;

Wherein the method comprises the steps of Is an implicit layer output characteristic matrix with the dimension of,Is a transpose of the matrix of input samples,Is a bias vector with dimensions ofEach elementTo hidden layer (L)And the bias of the individual neurons. Output features to hidden layersPerforming Moore-Penrose generalized inverse matrix calculation to obtain an analysis weight matrix from an implicit layer to an output layer. The generalized inverse matrix has the following calculation formula:

;

Wherein the method comprises the steps of Is thatMoore-Penrose generalized inverse,Is a label matrix of target output and represents the real characteristic value of the training sample. The generalized inverse matrix is used for calculating an analytic solution through a least square method and optimizing the output weight of the model. Implicit layer based output featuresIt is screened by a selection mechanism. The selection mechanism calculates Euclidean distance between the output characteristics of the hidden layer and the central characteristics, and the Euclidean distance is used for measuring the similarity. Let the central feature vector beThen (1)The Euclidean distance between each feature and the central feature is:

;

Wherein the method comprises the steps of Is the value of the distance from which the object is to be moved,Is the first hidden layerThe feature vector is outputted in a number of ways,Representing a binary norm. Selecting a distance less than a thresholdFeatures of (a) constitute a subset of speech featuresFor subsequent forgetting mechanism processing. Forgetting mechanism calculates time attenuation coefficient of each featureAnd weighting the historical characteristics. The formula of the attenuation coefficient is:

;

Wherein the method comprises the steps of Is the time interval between the feature and the current time,Is a time constant, controlling the decay rate. The updated speech features are:

;

Wherein the method comprises the steps of Is the feature matrix after the update of the feature matrix,The representation matrix is multiplied by elements point by point,Is a diagonal matrix of time decay coefficients. Updated speech featuresAnd resolving a weight matrixMultiplying to obtain the speech feature predicted value of the next momentThe formula is:

;

and sequentially splicing the predicted values at all the moments to finally generate complete predicted voice characteristic data.

In a specific embodiment, the process of performing step S4 may specifically include the following steps:

carrying out logarithmic magnitude calculation on the predicted voice characteristic data, and carrying out magnitude normalization processing based on a frequency response function to obtain a standardized logarithmic spectrum sequence;

Constructing a frequency band energy mapping relation based on the frequency band characteristic association matrix, and carrying out weighted summation operation on the standardized logarithmic spectrum sequence according to the frequency band to obtain a frequency band energy distribution matrix;

Calculating Euclidean distance between each time-frequency point and adjacent time-frequency points in the frequency band energy distribution matrix, and carrying out voice probability calculation based on a distance threshold value to obtain a time-frequency probability distribution matrix;

Performing cluster analysis on the voice time-frequency points according to the time-frequency probability distribution matrix, and dividing the time-frequency points into a voice main body region and a transition region to obtain a region marking matrix;

Performing contour extraction and optimal voice boundary calculation on a voice main body region in a region marking matrix to obtain a voice boundary track, and performing cubic spline interpolation calculation on time-frequency points of a transition region based on the voice boundary track to obtain spectrum interpolation data of the transition region;

And carrying out weighted fusion on the spectrum data of the voice main body area and the spectrum interpolation data of the transition area to obtain a logarithmic spectrum matrix, and carrying out inverse logarithmic transformation and phase reconstruction on the logarithmic spectrum matrix to obtain reconstructed spectrum data.

In particular, for predicted speech feature dataAnd carrying out log-amplitude calculation to compress the dynamic range of the data, so that the frequency spectrum data better accords with the perception characteristics of human ears. The logarithmic magnitude is calculated as:

;

Wherein the method comprises the steps of Is the frequencyAnd time ofThe amplitude value at which the amplitude value is calculated,Is a log-magnitude spectrum used to represent the spectral energy distribution on a log scale. For a pair ofAmplitude normalization processing is performed based on the frequency response function to eliminate non-uniformities in different frequency responses. The normalization formula is:

;

Wherein the method comprises the steps of Is a normalized log-magnitude spectrum,Is a frequency response function representing the weights of the different frequency components. Through this process, a normalized log spectrum sequence is generatedThe amplitude values of which are adjusted to a uniform scale over all frequency ranges. Based on frequency band characteristic association matrixAnd constructing a frequency band energy mapping relation for extracting energy distribution of each frequency band. Frequency band characteristic association matrixIs thatIs a symmetric matrix of the (c) in the (c),Representing the number of bands, matrix elementsCharacterizing frequency bandsSum frequency bandDegree of association between the two. For normalized logarithmic spectrum sequenceWeighting and summing operation is carried out according to the frequency band division to obtain a frequency band energy distribution matrixThe calculation formula is as follows:

;

Wherein the method comprises the steps of Representing frequency bandsAt the time ofIs used for the energy value of (a),Is the frequency bandA set of frequency components contained. Energy distribution matrix for frequency bandsAnd (3) calculating Euclidean distance between each time frequency point and each adjacent time frequency point so as to measure the change degree of energy distribution. Let the band energy of the current point beThe energy of adjacent points isEuclidean distance between two pointsThe formula of (2) is:

;

Wherein, Is Euclidean distance used for representing the smoothness of time-frequency point change. By and through a preset distance thresholdComparing, calculating the probability of speechThe formula is:

;

thereby generating a time-frequency probability distribution matrix A value of 1 indicates a high probability speech region and a value of 0 indicates a non-speech region. Based on the time-frequency probability distribution matrix, clustering analysis is carried out on the voice time-frequency points, and the time-frequency points are divided into a voice main body area and a transition area. The clustering analysis adopts a connected component method, adjacent high probability points are classified into a voice main body region, other points are classified into transition regions, and a region marking matrix is generatedThe element value marks the region type to which the time-frequency point belongs. Marking a matrix of regionsThe main boundary of the voice is recognized by extracting the outline of the main voice body area, and the optimal voice boundary track is calculated by a dynamic programming method. The formula of the voice boundary track is:

;

Wherein, Is the time ofThe corresponding band index is used for describing the boundary of the voice. Based on voice boundary trajectoriesPerforming cubic spline interpolation calculation on time-frequency points of the transition region to generate spectrum interpolation data of the transition region. The cubic spline interpolation formula is:

;

Wherein the method comprises the steps of Is the coefficient of spline interpolation, is calculated by known boundary track points,Representing the frequency index. Spectral data of a speech subject regionSpectral interpolation data with transition regionsWeighting and fusing to obtain a complete logarithmic spectrum matrix. The fusion formula is:

;

Wherein the method comprises the steps of Is a weighting coefficient used to balance the contributions of the two parts of data. Will log the spectrum matrixPerforming an inverse logarithmic transformation to recover amplitude values while combining the original phase dataPhase reconstruction is carried out to finally obtain reconstructed spectrum dataThe formula is as follows:

;

Wherein the method comprises the steps of Representing the reconstructed phase information of the phase signal,Is an inverse logarithmic transformation of the amplitude values.

In a specific embodiment, the audio processing method for performing speech spectrum reconstruction further includes the steps of:

Carrying out power spectrum calculation and moving average processing on each time-frequency point of reconstructed spectrum data to obtain a smoothed power spectrum matrix, and calculating background noise power spectrums of each frequency band based on the smoothed power spectrum matrix to obtain a noise power spectrum matrix;

Calculating a priori signal-to-noise ratio and a posterior signal-to-noise ratio for each frequency band according to the ratio relation between the smoothed power spectrum matrix and the noise power spectrum matrix to obtain a signal-to-noise ratio parameter matrix, and carrying out weighted calculation on the signal-to-noise ratio parameter matrix through a wiener filter transfer function to obtain a self-adaptive filter coefficient matrix;

performing complex matrix multiplication operation on the coefficient matrix of the adaptive filter and the reconstructed spectrum data to obtain a filtered voice spectrum, dividing the filtered voice spectrum according to sub-bands, and calculating the energy mean value and variance of each sub-band to obtain dynamic gain control parameters;

Performing nonlinear mapping operation and signal amplitude compensation on the dynamic gain control parameters and the filtered voice spectrum to obtain a voice spectrum after gain compensation, and performing inverse short-time Fourier transform and windowing overlap addition on the voice spectrum after gain compensation to obtain an enhanced voice signal;

Extracting the Mel frequency cepstrum coefficient characteristic of the enhanced voice signal, and inputting the Mel frequency cepstrum coefficient characteristic into the deep neural network model for voice recognition to obtain voice recognition result.

Specifically, reconstructed spectral dataPerforming power spectrum calculation on each time-frequency point of the power spectrumThe expression of (2) is:

;

Wherein the method comprises the steps of Is the frequencyAnd time ofComplex spectral values at the locations, including amplitude and phase information,Is the amplitude of the signal which is then,Is the power value corresponding to the time frequency point, and reflects the energy distribution of the signal. For power spectrumA moving average process is performed to smooth out short-time fluctuations and extract stable characteristics. The calculation formula of the sliding average is:

;

Wherein the method comprises the steps of Is a smoothed power spectrum matrix and,Is the width of the sliding window, controls the degree of smoothness,Representing the index of adjacent frames on the time axis. The moving average process can eliminate high frequency noise in the power spectrum, resulting in smoother energy distribution. Based on smoothed power spectrum matrixThe background noise power spectrum of each frequency band is estimated by a statistical method. Using minimum tracking or recursive averaging, where the background noise power spectrumThe estimation formula of (2) is:

;

Wherein the method comprises the steps of Is the frequencyAnd noise power value, and the stable level of the background noise is estimated through minimum value tracking on a time axis. According to the smoothed power spectrum matrixAnd background noise power spectrumCalculating a priori signal to noise ratioSum posterior signal to noise ratio. The calculation formula of the posterior signal-to-noise ratio is as follows:

;

Wherein the method comprises the steps of Indicating the signal at frequencyAnd time ofIs a posterior signal to noise ratio of (c). The prior signal to noise ratio is estimated by a recursive formula, wherein the formula is as follows:

;

Wherein the method comprises the steps of Is the a priori signal to noise ratio,Is a smoothing factor that controls the update rate of the a priori signal to noise ratio. Calculating an adaptive filter coefficient matrix by using a priori signal-to-noise ratio and a posterior signal-to-noise ratio through a transfer function of a wiener filter. The transfer function formula of the wiener filter is:

;

Wherein the method comprises the steps of Is the gain of the wiener filter, expressed in terms of frequencyAnd time ofIs a degree of enhancement of (a). Coefficient matrix of adaptive filterAnd reconstructing spectral dataPerforming complex matrix multiplication to obtain a filtered speech spectrumThe formula is:

;

The filtered speech spectrum retains the speech components of high signal-to-noise ratio and suppresses the noise components of low signal-to-noise ratio. For filtered speech spectrum And analyzing according to the sub-band division, and calculating the energy mean value and variance of each sub-band. Set the firstThe frequency range of the sub-bands isIts energy average valueSum of variancesThe formula of (2) is:

;

Wherein the method comprises the steps of Is the average value of the energy of the sub-band,Is the variance of the sub-band energy,Is a sub-bandFrequency points in (a) are provided. Calculating dynamic gain control parameters using energy mean and varianceThe formula is:

;

Wherein the method comprises the steps of Is a small positive value to avoid zero denominator. Dynamic gain control parametersThe method is applied to the voice spectrum after filtering, nonlinear mapping operation and signal amplitude compensation are carried out, and the voice spectrum after gain compensation is obtained. The formula of the nonlinear mapping is:

;

for voice frequency spectrum after gain compensation Performing inverse short time Fourier transform to restore to time domain, and performing window overlap-add operation to obtain enhanced voice signal. And extracting the mel frequency cepstrum coefficient characteristics by utilizing the enhanced voice signals, inputting the mel frequency cepstrum coefficient characteristics into a deep neural network model for voice recognition, and finally obtaining a voice recognition result.

The method comprises the steps of carrying out pitch tracking analysis on an enhanced voice signal, calculating a voice fundamental frequency track, carrying out harmonic decomposition on the signal based on the voice fundamental frequency track to obtain a harmonic component set, carrying out self-adaptive bandwidth adjustment on the harmonic component set based on the vehicle running speed, carrying out filtering processing on each harmonic component through a speed-related band-pass filter bank to obtain a speed-adaptive harmonic matrix, carrying out envelope detection on the speed-adaptive harmonic matrix, establishing an envelope modulation function in combination with vehicle speed information to obtain a dynamic modulation coefficient, constructing an adaptive gain control matrix based on the dynamic modulation coefficient, carrying out nonlinear amplitude compensation on the speed-adaptive harmonic matrix to obtain an enhanced harmonic signal, carrying out time-frequency resolution self-adaptive adjustment on the enhanced harmonic signal by adopting a multi-resolution analysis method based on the vehicle speed to obtain a resampled voice signal, carrying out pre-compensation processing on the resampled voice signal based on the vehicle speed and road condition information to obtain a pre-processed voice signal, carrying out grading on the pre-processed voice signal through a voice quality evaluation function, carrying out dynamic adjustment on gain control parameters according to the evaluation result, obtaining an optimized control parameter, feeding back the optimized control parameter to the enhanced voice signal to the enhancement voice signal, and carrying out enhancement processing on the voice signal to obtain the final enhancement voice signal.

In a specific embodiment, the performing step extracts mel-frequency cepstrum coefficient features of the enhanced speech signal, and inputs the mel-frequency cepstrum coefficient features into the deep neural network model for speech recognition, and the process of obtaining the speech recognition result may specifically include the following steps:

Performing 512-point fast Fourier transform on the enhanced voice signal, and calculating energy of each frequency band based on 26 triangular Mel filter groups to obtain Mel frequency band energy spectrum;

carrying out logarithmic operation and discrete cosine transformation on the mel frequency band energy spectrum, extracting the first 13 coefficients as static characteristics, and calculating first-order difference and second-order difference characteristics to obtain mel frequency cepstrum coefficient characteristics;

Inputting the mel frequency cepstrum coefficient characteristics into a two-way long short-time memory layer of the deep neural network model, wherein the two-way long short-time memory layer comprises 256 memory units, and performing time sequence modeling on the voice characteristics to obtain context coding characteristics;

Inputting the context coding features into a self-attention layer of the deep neural network model, wherein the self-attention layer comprises 8 attention heads, the dimension of each attention head is 32, and obtaining weighted context features through dot product attention calculation;

Inputting the weighted context characteristics into a 3-layer feedforward neural network of a deep neural network model for processing, wherein each layer of feedforward neural network comprises 512 neurons and a ReLU activation function, and adding residual connection between layers to obtain high-level semantic characteristics;

And performing sequence labeling on the high-level semantic features, calculating state transition probability based on the dependency relationship of the front frame and the rear frame to obtain a phoneme recognition sequence, performing decoding processing on the phoneme recognition sequence through a language model, and performing constraint based on a dictionary and grammar rules to obtain a voice recognition result.

In particular, for enhanced speech signalsShort-time framing is performed, and each frame contains 512 sampling points. For each frame of signal, a fast Fourier transform is applied to transform the time domain signal into the frequency domain to obtain a spectral representation. The calculation formula is as follows:

;

Wherein the method comprises the steps of Is a frequency index with a value range of(Corresponding to 256 positive frequency components),Is the firstThe time-domain magnitudes of the individual sample points,Is the frequencyIs a complex spectrum value of (a). Amplitude spectrum calculated based on fast Fourier transformCalculating energy distribution of each frequency band through 26 triangular Mel filter groups to generate Mel frequency band energy spectrum. The Mel filter is divided according to Mel frequency scale, and the calculation formula of Mel frequency is:

;

Wherein the method comprises the steps of Is a linear frequency of the signal at which,Is the corresponding mel frequency. The frequency range of the filter is defined by a linear frequencyAndAnd (5) determining. First, theThe energy calculation formula of each filter is as follows:

;

Wherein the method comprises the steps of Is a mel filterIs used for the energy value of (a),Is the firstThe weight function of the individual filters is such that,AndIs the frequency range of the filter. For mel-band energy spectrumAnd carrying out logarithmic operation to compress the dynamic range so as to enable the dynamic range to be more in line with the perception characteristics of human ears. The formula for the logarithmic energy spectrum is:

;

Wherein the method comprises the steps of Is the value of the logarithmic energy,Is a small constant that prevents zero from occurring in the logarithmic operation. For logarithmic energy spectrumDiscrete cosine transform is applied to extract low frequency components as speech features and generate Mel Frequency Cepstrum Coefficients (MFCCs). The discrete cosine transform formula is:

;

Wherein MFCC Is the firstThe first 13 coefficients are usually extracted as static features from the cepstral coefficientsThese low-order coefficients contain the main speech information. To enhance dynamic characteristics, first-order differential and second-order differential features are calculated for static features. The first order difference is calculated by the following formula:

;

Wherein the method comprises the steps of Is the firstA first order differential feature is provided that,Is the length of the calculation window. The formulas of the second-order difference features are similar, and the first-order difference features are used for replacing static features to perform calculation. A complete sequence of MFCC features including static features, first order differences and second order differences is input into a bi-directional long short term memory layer (BiLSTM) of the deep neural network model. BiLSTM contain 256 memory units that are capable of capturing long-term context information from bi-directional modeling. The state update formula of the memory cell is:

;

Wherein the method comprises the steps of Is the time ofIs used to determine the hidden state of the (c),Is a feature of the current input that is to be used,Is the hidden state of the last moment. Bi-directional processing simultaneously considers information of past and future frames, generating context-coding features. Encoding context featuresInput to the self-attention layer, which contains 8 attention heads, each of dimension 32. The attention mechanism calculates the weight by dot product, and the formula is:

;

Wherein the method comprises the steps of Respectively a query, a key and a matrix of values,Is the dimension of the key vector. Weighted contextual features are derived through the attention mechanism for capturing important relationships in the sequence. The weighted contextual features are input into a 3-layer feed forward network of deep neural networks, each layer containing 512 neurons and ReLU activation functions, the residual connection of the feed forward network improving training stability and model expression capability. The output of the feed forward network is a high level semantic feature. And (3) carrying out sequence labeling on the high-level semantic features, calculating state transition probability through the dependency relationship of the front frame and the rear frame, and generating a phoneme recognition sequence. And decoding the phoneme sequence based on the language model, and generating a voice recognition result by using the dictionary and the grammar rule to carry out constraint.

The embodiment further comprises performing noise characteristic self-adaptive learning and voice characteristic dynamic correction on a voice recognition result, and specifically comprises performing subspace analysis on a low-rank matrix representing noise components, extracting a noise characteristic main direction by a principal component analysis method to obtain a noise characteristic subspace, performing descending order on the noise characteristic subspace according to an energy contribution rate, selecting characteristic vectors with accumulated contribution rate exceeding 95% to obtain a key noise characteristic base, constructing an online noise dictionary based on the key noise characteristic base, performing projection analysis on a newly input voice signal to obtain a noise pattern mapping matrix, performing sparse coding on the noise pattern mapping matrix, calculating sparse representation coefficients by an orthogonal matching pursuit algorithm to obtain a noise characteristic code, performing time-frequency alignment on the noise characteristic code and the voice recognition result to establish a noise-voice characteristic mapping relation to obtain a characteristic correction template, performing self-adaptive adjustment on predicted voice characteristic data based on the characteristic correction template, updating a characteristic value by a weighted average method to obtain corrected voice characteristic, performing spectrum reconstruction and voice recognition processing on the corrected voice characteristic to obtain an optimized voice recognition result, feeding back to the noise recognition result, updating the noise characteristic base, and performing self-adaptive cyclic correction template.

The above describes the audio processing method for speech spectrum reconstruction in the embodiment of the present invention, and the following describes the audio processing system for speech spectrum reconstruction in the embodiment of the present invention, referring to fig. 2, an embodiment of the audio processing system for speech spectrum reconstruction in the embodiment of the present invention includes:

the prediction module is used for inputting the sparse matrix into the selective forgetting extreme learning machine to predict the voice characteristics so as to obtain predicted voice characteristic data;

The computing module is used for carrying out voice probability computation and feature interpolation on each time-frequency point by adopting a logarithmic spectrum amplitude computing method based on the predicted voice feature data and the frequency band feature incidence matrix to obtain the reconstructed spectrum data.

The method comprises the steps of carrying out prediction on voice characteristics through the collaborative cooperation of the components, carrying out precise reconstruction on time frequency points through the combination of logarithmic spectrum amplitude calculation and characteristic interpolation technology, effectively maintaining the continuity and naturalness of the voice signals, designing a self-adaptive wiener filtering and dynamic gain control strategy, dynamically adjusting filter parameters and gain coefficients according to signal-to-noise ratio, obviously improving voice enhancement effect, carrying out voice recognition by adopting a deep neural network model, effectively extracting the long-short-term dependence of the voice characteristics through a bidirectional long-short-term memory network and a self-attention mechanism, improving the accuracy of voice recognition, and simultaneously meeting the requirements of real-time processing under a vehicle-mounted environment.

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, systems and units may refer to the corresponding processes in the foregoing method embodiments, which are not repeated herein.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. The storage medium includes a U disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, an optical disk, or other various media capable of storing program codes.

While the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those skilled in the art that the foregoing embodiments may be modified or equivalents may be substituted for some of the features thereof, and that the modifications or substitutions do not depart from the spirit and scope of the embodiments of the invention.

Claims

1. An audio processing method for speech spectrum reconstruction, the method comprising:

2. The audio processing method for reconstructing a speech spectrum according to claim 1, wherein the preprocessing the original speech signal collected by the vehicle microphone to obtain a speech signal time-frequency matrix comprises:

Performing time-frequency analysis on the complex frequency spectrum through short-time Fourier transformation to obtain a time-varying amplitude spectrum and a time-varying phase spectrum of the voice signal, and performing complex domain synthesis on the time-varying amplitude spectrum and the time-varying phase spectrum of the voice signal to obtain time-frequency characteristic data of the voice signal;

3. The audio processing method for speech spectrum reconstruction according to claim 2, wherein the performing singular value decomposition on the speech signal time-frequency matrix to obtain a sparse matrix representing speech components and a low-rank matrix representing noise components, and establishing a frequency band feature correlation matrix by a mutual information entropy calculation method comprises:

sorting singular values in the decomposition matrix group according to the magnitude, and separating and processing a voice component and a noise component based on a preset threshold value to obtain a voice singular value sequence and a noise singular value sequence;

constructing a diagonal matrix based on the voice singular value sequence, and performing three-matrix multiplication operation on the diagonal matrix and a U matrix and a V matrix in the decomposition matrix group to obtain a sparse matrix representing voice components;

Constructing a diagonal matrix based on the noise singular value sequence, and performing three-matrix multiplication operation on the diagonal matrix and a U matrix and a V matrix in the decomposition matrix group to obtain a low-rank matrix representing noise components;

sub-band division is carried out on the sparse matrix according to a Mel frequency scale to obtain a sub-matrix sequence of K frequency bands, the sub-matrix sequences of the K frequency bands are paired in pairs, and mutual information entropy values are calculated for each pair of sub-matrices to obtain a mutual information entropy matrix;

And carrying out symmetry inspection and normalization processing on the mutual information entropy matrix to obtain frequency band correlation data, and constructing a frequency band characteristic correlation matrix based on the frequency band correlation data, wherein matrix element values of the frequency band characteristic correlation matrix represent the correlation degree among different frequency bands.

4. The audio processing method for speech spectrum reconstruction according to claim 3, wherein said inputting the sparse matrix into a select forgetting extreme learning machine for speech feature prediction to obtain predicted speech feature data comprises:

Randomly generating connection weights between an input layer and an hidden layer of the forgetting extreme learning machine based on the voice characteristic training sequence to obtain an input weight matrix;

Multiplying the input weight matrix with the voice feature training sequence, superposing bias items, and performing activation operation through a hyperbolic tangent function to obtain hidden layer output features;

Calculating time attenuation coefficients of the features at each moment in the voice feature subset based on a forgetting mechanism, and forgetting historical features to obtain updated voice features;

5. The audio processing method for speech spectrum reconstruction according to claim 4, wherein the performing speech probability calculation and feature interpolation on each time-frequency point by using a logarithmic spectrum magnitude calculation method based on the predicted speech feature data and the frequency band feature correlation matrix to obtain reconstructed spectrum data includes:

Constructing a frequency band energy mapping relation based on the frequency band characteristic association matrix, and carrying out weighted summation operation on the standardized logarithmic spectrum sequence according to a frequency band to obtain a frequency band energy distribution matrix;

Performing contour extraction and optimal voice boundary calculation on a voice main body region in the region marking matrix to obtain a voice boundary track, and performing cubic spline interpolation calculation on time-frequency points of a transition region based on the voice boundary track to obtain spectrum interpolation data of the transition region;

6. The audio processing method for speech spectrum reconstruction as claimed in claim 5, the audio processing method for reconstructing the voice spectrum is characterized by further comprising the following steps:

carrying out power spectrum calculation and moving average processing on each time-frequency point of the reconstructed spectrum data to obtain a smoothed power spectrum matrix, and calculating background noise power spectrums of all frequency bands based on the smoothed power spectrum matrix to obtain a noise power spectrum matrix;

Performing complex matrix multiplication operation on the adaptive filter coefficient matrix and the reconstructed spectrum data to obtain a filtered voice spectrum, dividing the filtered voice spectrum according to sub-bands, and calculating the energy mean value and variance of each sub-band to obtain dynamic gain control parameters;

Performing nonlinear mapping operation and signal amplitude compensation on the dynamic gain control parameters and the filtered voice spectrum to obtain a gain-compensated voice spectrum, and performing inverse short-time Fourier transform and windowing overlap addition on the gain-compensated voice spectrum to obtain an enhanced voice signal;

Extracting the Mel frequency cepstrum coefficient characteristic of the enhanced voice signal, and inputting the Mel frequency cepstrum coefficient characteristic into a deep neural network model for voice recognition to obtain a voice recognition result.

7. The audio processing method for speech spectrum reconstruction according to claim 6, wherein the extracting mel-frequency cepstrum coefficient features of the enhanced speech signal and inputting the mel-frequency cepstrum coefficient features into a deep neural network model for speech recognition, and obtaining a speech recognition result comprises:

Performing 512-point fast Fourier transform on the enhanced voice signal, and calculating energy of each frequency band based on 26 triangular Mel filter banks to obtain Mel frequency band energy spectrum;

inputting the mel frequency cepstrum coefficient characteristics into a two-way long short-time memory layer of a deep neural network model, wherein the two-way long short-time memory layer comprises 256 memory units, and performing time sequence modeling on voice characteristics to obtain context coding characteristics;

Inputting the context coding features into a self-attention layer of the deep neural network model, wherein the self-attention layer comprises 8 attention heads, the dimension of each attention head is 32, and the weighted context features are obtained through dot product attention calculation;

Inputting the weighted context characteristics into a 3-layer feedforward neural network of the deep neural network model for processing, wherein each layer of feedforward neural network comprises 512 neurons and ReLU activation functions, and residual connection is added between layers to obtain high-level semantic characteristics;

and performing sequence labeling on the high-level semantic features, calculating state transition probability based on the dependency relationship of the front and rear frames to obtain a phoneme recognition sequence, performing decoding processing on the phoneme recognition sequence through a language model, and performing constraint based on a dictionary and grammar rules to obtain a voice recognition result.

8. An audio processing system for speech spectrum reconstruction, characterized in that, audio processing method for performing a speech spectrum reconstruction according to any of claims 1-7, the speech spectrum reconstruction audio processing system comprising: