Detailed Description
Example embodiments according to the present disclosure will be described in detail below with reference to the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present disclosure and not all of the embodiments of the present disclosure, and that the present disclosure is not limited by the example embodiments described herein.
It should be noted that the relative arrangement of the components and steps, numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless it is specifically stated otherwise.
It will be appreciated by those of skill in the art that the terms "first," "second," etc. in embodiments of the present disclosure are used merely to distinguish between different steps, devices or modules, etc., and do not represent any particular technical meaning nor necessarily logical order between them.
It should also be understood that in embodiments of the present disclosure, "plurality" may refer to two or more, and "at least one" may refer to one, two or more.
It should also be appreciated that any component, data, or structure referred to in the presently disclosed embodiments may be generally understood as one or more without explicit limitation or the contrary in the context.
In addition, the term "and/or" in the present disclosure is merely an association relationship describing the association object, and indicates that three relationships may exist, such as a and/or B, and may indicate that a exists alone, while a and B exist together, and B exists alone. In addition, the character "/" in the present disclosure generally indicates that the front and rear association objects are an or relationship.
It should also be understood that the description of the various embodiments of the present disclosure emphasizes the differences between the various embodiments, and that the same or similar features may be referred to each other, and for brevity, will not be described in detail.
Meanwhile, it should be understood that the sizes of the respective parts shown in the drawings are not drawn in actual scale for convenience of description.
The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.
Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail, but are intended to be part of the specification where appropriate.
It should be noted that like reference numerals and letters refer to like items in the following figures, and thus once an item is defined in one figure, no further discussion thereof is necessary in subsequent figures.
Embodiments of the present disclosure are applicable to electronic devices such as terminal devices, computer systems, servers, etc., which are operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known terminal devices, computing systems, environments, and/or configurations that may be suitable for use with an electronic device such as a terminal device, computer system, or server include, but are not limited to, personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, microprocessor-based systems, set-top boxes, programmable consumer electronics, network personal computers, minicomputers systems, mainframe computer systems, distributed cloud computing environments that include any of the above systems, and the like.
Electronic devices such as terminal devices, computer systems, servers, etc. may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc., that perform particular tasks or implement particular abstract data types. The computer system/server may be implemented in a distributed cloud computing environment. In a distributed cloud computing environment, tasks may be performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computing system storage media including memory storage devices.
Summary of the disclosure
In carrying out the present disclosure, the inventors have found that in the existing block frequency domain kalman filter (FKF) technique, the frequency domain filter estimation parameters corresponding to the b-th speech frequency domain block in the k+1-th speech frameCan be expressed in the form of the following formula (1):
in the above formula (1), a represents the uncertainty of the acoustic echo path, and the value of a is constant, for example, the value of a may be 1; The frequency domain filter estimation parameters (here, uppercase W, corresponding frequency domain) corresponding to the b-th speech frequency domain block in the kth speech frame are represented, G L,0 represents the first time domain constraint matrix, μ b (k) represents the frequency domain filter step size matrix corresponding to the b-th speech frequency domain block in the kth speech frame, X b (k) represents the frequency domain reference signal (here, uppercase X, corresponding frequency domain) of the b-th speech frequency domain block in the kth speech frame, and E (k) represents the frequency domain error signal (uppercase E corresponding to the frequency domain) of the kth speech frame.
In order to analyze the steady-state solution characteristics of the block frequency domain kalman filter technique, the inventors assume that a=1, and multiply the two sides of the equal sign of the above formula (1) with the discrete fourier transform inverse matrix F -1 having the size of the frame length by the frame length, respectively, and sort them, the following formula (2) can be obtained:
In the above formula (2), M b(k)=Fμb(k)F-1;Xc,b(k)=FXb(k)F-1, L denotes the length of each speech frequency domain block in the speech frame, I L denotes an identity matrix of size l×l, e (k) denotes a time domain error signal of the kth speech frame, and C denotes a cyclic matrix, i.e., X C,b (k) is a cyclic matrix.
M b and X C,b (k) in the above formula (2) are cyclic matrices, and M b and X C,b (k) can be expressed as the form of the following formula (3), respectively:
e (k) in the above formula (2) can be expressed in the form of the following formula (4)
In the above formula (4), e (k) represents a time domain error signal of the kth speech frame (lower case e corresponds to time domain), y (k) represents a time domain desired signal of the kth speech frame (lower case y corresponds to time domain); Representing the time-domain filter estimation parameter (herein, lower case w, corresponding to time domain) corresponding to the B-th speech frequency-domain block in the kth speech frame, the time-domain filter estimation parameter may be considered as an equivalent time-domain form of the frequency-domain filter estimation parameter, and B represents the total number of speech frequency-domain blocks contained in a speech frame.
Taking mathematical expectations on both sides of the equal sign of the above formula (2), the following formula (5) can be obtained:
in the above formula (5), E { x } represents the mathematical expectation of taking x; Representing a time domain filter estimation parameter corresponding to a b-th speech frequency domain block in a k+1th speech frame; Representing the time-domain filter estimation parameters (in this case, lower case w, corresponding to time domain) corresponding to the b-th speech frequency-domain block in the kth speech frame,/b,1(k)、Λb,2(k)、Rb,m And r b can be expressed in the form of the following formula (6), and B represents the number of voice frequency domain blocks contained in one voice frame.
rb=E{XC,b,2(k)y(k)}
In the formula (6), E is the mathematical expectation of the expression, C is the cyclic matrix, X C,b,2 (k) and X C,b,1 (k) are each an element in X C,b (k), and X C,b (k) is shown in the formula (3); The transposed matrix of X C,m,2 (k) is represented, y (k) is represented by the time domain expected signal of the kth speech frame, M b,1 (k) and M b,2 (k) are each an element of M b (k), M b (k) is represented as shown in the above formula (3), Λ b,1 (k) is represented by the matrix taking the mathematical expectation of M b,1 (k), and Λ b,2 (k) is represented by the matrix taking the mathematical expectation of M b,2 (k).
From the above formula (5) and formula (6), it can be seen that the steady-state solution of the frequency domain filter coefficients in the existing blocking frequency domain kalman filtering technique can be expressed as the following formula (7):
In the above formula (7), b represents a b-th speech frequency domain block in a speech frame; Representing the time-domain filter estimation parameter (here, lower case w, corresponding to time domain) corresponding to the b-th speech frequency domain block in the speech frame; a steady-state solution representing the estimation parameters of the time-domain filter corresponding to the b-th speech frequency-domain block in the speech frame, lambda b,1(∞)、Λb,2(∞)、Rb,m, Rb,b、And r b can be expressed in the form of the following formula (2), and B represents the number of voice frequency domain blocks contained in one voice frame.
Assuming that the actual adaptive filter order is N 'and the unknown target system filter order to be matched is N, if N' < N, the signal y (k) output by the unknown target system filter to be matched can be expressed as the form of the following equation (8):
in the above formula (8), s (k) represents background noise of a kth speech frame; representing the transposed matrix of X C,b,1 (k), X C,b,1 (k) is an element of X C,b (k), as shown in equation (3) above, and w o,m represents the mth block (here, in lowercase w, corresponding to the time domain) in the optimal solution of the time domain filter coefficients.
Substituting the above formula (8) into the above formula (7) and sorting, the following formula (9) can be obtained:
As can be seen from the above formula (9), when the value of the acoustic echo path uncertainty a is 1 and the adaptive filter order is sufficient, the formula (9) can prove that the steady-state solution of the block frequency domain kalman filter coefficient is equal to the optimal solution, but when the adaptive filter order is insufficient, the above formula (8) cannot be established, and thus the above formula (9) cannot be obtained, and the above formula (7) cannot converge the frequency domain kalman filter coefficient to the optimal solution, that is, the optimal solution of the frequency domain kalman filter coefficient is inconsistent with the steady-state solution.
Exemplary overview
In applications such as live conferencing, teleconferencing, and voice interaction, the disclosed voice filtering techniques may be used to achieve echo cancellation, active noise control, channel equalization, and the like.
An example is shown in fig. 1. The microphone 101 is provided on the platform 100. The microphone 101 is connected to the data processing device 102, and the data processing device 102 may be connected to at least one sound box. The loudspeaker 103 is only schematically shown in fig. 1, and the connection between the data processing device 102 and the loudspeaker 103 may be wireless.
Assume that the speaker is in front of the microphone 101 to speak. The speaker's voice is mixed with the echo and collected by the microphone 101, and an audio signal in the time domain is obtained. The data processing device 102 converts the acquired audio signal in the time domain into speech frames in the frequency domain, and processes each speech frame separately (which may be referred to as a dereverberation process) using the speech filtering techniques provided by the present disclosure to remove echoes in each speech frame. After that, the data processing device 102 may perform a voice separation process or the like on the processed voice frame, thereby obtaining a separated speaker sound source signal, and the data processing device 102 forms an output signal according to the speaker sound source signal and plays it through the sound box 103. In the application scenario, the data processing device 102 can effectively avoid the phenomenon that the collected echoes are played at the same time, so that the voice definition of the main speaker is improved, and the conference participants participating in the live conference can hear the speech clearly of the main speaker.
In addition, the data processing device 102 may transmit the obtained sound source signal of the speaker to a device in a different place conference site (such as a data processing device in the different place conference site) in real time through a network, and the device in the different place conference site performs playing processing according to the received sound source signal, so as to realize the teleconference.
Another example is shown in fig. 2. A microphone may be provided in the portable translator device 200 (e.g., smart mobile phone, etc.). The translation device 200 is used to implement bilingual translation.
During a conversation between user 201 and user 202, user 201 places his translation device 200 in an operational state for bi-directional translation between a first language (e.g., chinese, etc.) and a second language (e.g., english, etc.).
The translation apparatus 200 converts an audio signal in a time domain, which is acquired in real time by a microphone provided therein, into a voice frame in a frequency domain. The translation device 200 may utilize the voice filtering technology provided in the present disclosure to process each voice frame (may be referred to as noise suppression processing) so as to remove background noise in the voice frame, thereby being beneficial to avoiding the influence of the background noise on the subsequent voice recognition processing and being beneficial to improving the voice definition of the current speaker. After that, the translation apparatus 200 may perform a speech separation process, a speech recognition process, and the like for the previously processed speech frame, and the translation apparatus 200 may determine the language used by the current speaker and the content of the speech of the current speaker based on the result obtained by the speech recognition process operation. Finally, the translation device 200 may convert the content of the current speaker's speech into another language and output, for example, the translation device 200 displays the converted another language through its display screen, and for example, the translation device 200 plays the converted another language through its speaker.
Repeating the above-described operations of collecting an audio signal, removing background noise, separating speech, recognizing speech, converting speech, and the like can help the user 201 and the user 202 achieve a continuous dialogue.
Exemplary method
Fig. 3 is a schematic structural diagram of one embodiment of a speech filtering method of the present disclosure. As shown in fig. 3, the method of this embodiment includes steps S300, S301, and S302. The steps are described separately below.
S300, acquiring frequency domain filtering parameters corresponding to each voice frequency domain block in the previous voice frame adjacent to the current voice frame.
The current speech frame and the last speech frame in this disclosure are both speech frames in the frequency domain. The current speech frame and the previous speech frame in the present disclosure may be speech frames obtained by performing a transform process (such as fourier transform) on a time domain signal acquired by an audio acquisition device. The last speech frame in the present disclosure refers to a speech frame that is adjacent in time sequence to the current speech frame and is located in time sequence before the current speech frame. The current speech frame and the last speech frame in the present disclosure respectively include a plurality of speech frequency domain blocks, and the number of speech frequency domain blocks included in the current speech frame is the same as the number of speech frequency domain blocks included in the last speech frame. The number of speech frequency domain blocks contained in the current speech frame and the previous speech frame is typically related to the filter block length. The speech frequency domain blocks in the present disclosure may refer to blocks in a block-wise frequency domain kalman filter (PFKF) algorithm.
The frequency domain filtering parameters in this disclosure may refer to parameters used by the filter. The frequency domain filtering parameters may be referred to as frequency domain filtering algorithm parameters. A speech frequency domain block typically corresponds to a set of frequency domain filtering parameters. Different speech frequency domain blocks typically correspond to different sets of frequency domain filtering parameters. A set of frequency domain filtering parameters typically includes a plurality of parameters.
S301, determining frequency domain filter estimation parameters corresponding to each voice frequency domain block in the current voice frame according to the frequency domain filtering parameters corresponding to each voice frequency domain block in the previous voice frame and two first time domain constraint matrixes.
The first time domain constraint matrix in the present disclosure may refer to a matrix for converting frequency domain filtering parameters into a time domain, constraining the frequency domain filtering parameters in the time domain, and then reconverting the constrained filtering parameters in the time domain into the frequency domain. The above constraint on the filtering parameters in the time domain may mean that the values of part of the elements in the filtering parameters in the time domain are limited, for example, the values of part of the elements exceeding the preset filter length in the filtering parameters in the time domain are zeroed. The frequency domain filter estimation parameters corresponding to each voice frequency domain block in the current voice frame in the present disclosure may refer to frequency domain filter coefficients corresponding to each voice frequency domain in the current voice frame. According to the frequency domain filter parameters and two first time domain constraint matrixes corresponding to each voice frequency domain block in the previous voice frame, a corresponding frequency domain filter algorithm (such as a frequency domain filter parameter iteration algorithm based on block frequency domain Kalman filtering) is adopted for calculation, so that frequency domain filter estimation parameters corresponding to each voice frequency domain block in the current voice frame are obtained. When the current speech frame is the previous speech frame, the frequency domain filter estimation parameters corresponding to the speech frequency domain blocks in the current speech frame are part of the frequency domain filter parameters corresponding to the speech frequency domain blocks in the previous speech frame.
S302, filtering the current voice frame according to the frequency domain filter estimation parameters corresponding to each voice frequency domain block in the current voice frame to obtain a frequency domain error signal of the current voice frame.
The filtering process in the present disclosure may include a process of obtaining a difference between an actual desired signal in the frequency domain and an estimated desired signal in the frequency domain. When the current speech frame is the previous speech frame, the obtained frequency domain error signal in the current speech frame is a part of the frequency domain filtering parameters corresponding to each speech frequency domain block in the previous speech frame. The actual desired signal may be obtained from external input information.
According to the method and the device, in the process of obtaining the frequency domain filter estimation parameters corresponding to each voice frequency domain block in the current voice frame, two time domain constraint matrixes are used, so that the filtering algorithm can be converged to the optimal solution under the condition that the filtering calculation complexity is basically not influenced, and the phenomenon of insufficient filter order is solved. Therefore, the technical scheme provided by the disclosure is beneficial to improving the voice filtering performance.
In an alternative example, the frequency domain filtering parameters corresponding to each of the voice frequency domain blocks in the previous voice frame adjacent to the current voice frame in the present disclosure may include a frequency domain filter step size matrix corresponding to each of the voice frequency domain blocks in the previous voice frame adjacent to the current voice frame, a frequency domain filter estimation parameter corresponding to each of the voice frequency domain blocks in the previous voice frame, a frequency domain reference signal of each of the voice frequency domain blocks in the previous voice frame, and a frequency domain error signal of the previous voice frame. The frequency domain reference signal in the present disclosure may be obtained from externally input information. The frequency domain filter step size matrix described above may also be referred to as the equivalent step size of the frequency domain filter. In the present disclosure, the frequency domain filter step size matrix, the frequency domain filter estimation parameter and the frequency domain error signal of the previous speech frame corresponding to each speech frequency domain block in the previous speech frame may be obtained by means of iterative computation, that is, in the next iterative computation process, the current speech frame may become the previous speech frame, and then the frequency domain filter step size matrix, the frequency domain filter estimation parameter and the frequency domain error signal of the current speech frame corresponding to each speech frequency domain block in the current speech frame obtained by the current iterative computation may become the frequency domain filter step size matrix, the frequency domain filter estimation parameter and the frequency domain error signal of the previous speech frame corresponding to each speech frequency domain block in the previous speech frame. For the technical solution of the present disclosure, the frequency domain reference signal of each voice frequency domain block in the previous voice frame may be considered as a known signal. For example, the present disclosure may obtain frequency domain reference signals for each speech frequency domain block in a previous speech frame from external input information.
By utilizing the parameters, the frequency domain filter estimation parameters corresponding to each voice frequency domain block in the current voice frame can be conveniently obtained, so that the convergence of the filtering algorithm to the optimal solution is facilitated.
In an alternative example, an example of the manner of obtaining the frequency domain filter step size matrix corresponding to each voice frequency domain block in the previous voice frame of the current voice frame according to the disclosure may be that, for the ith voice frequency domain block in the previous voice frame, the frequency domain filter step size matrix corresponding to the ith voice frequency domain block in the previous voice frame is determined according to the covariance matrix of the frequency domain error corresponding to the ith voice frequency domain block in the previous voice frame, the frequency domain reference signal of the ith voice frequency domain block in the previous voice frame, the autocorrelation matrix, the frame length and the block length of the background noise of the previous voice frame. For example, the present disclosure may determine a frequency domain filter step size matrix corresponding to each of the frequency domain blocks of speech in the previous speech frame using the following equation (10):
In the above formula (10), μ b (k+1) represents a frequency domain filter step size matrix corresponding to a B-th speech frequency domain block in a k+1-th speech frame, L represents a length of a speech frequency domain block, M represents a length of a speech frame, B represents a total number of speech frequency domain blocks included in a speech frame, P b (k) represents a covariance matrix of frequency domain errors corresponding to a B-th speech frequency domain block in a k-th speech frame, X b (k) represents a frequency domain reference signal of a B-th speech frequency domain block in a k-th speech frame, and ψ SS (k) represents an autocorrelation matrix of background noise of a k-th speech frame.
It should be specifically noted that μ b (k+1) in the above formula (10) may be used as the frequency domain filter step size matrix corresponding to the b-th speech frequency domain block in the previous speech frame in the present disclosure, that is, the k+1-th speech frame represents the previous speech frame of the current frame, and the k-th speech frame represents the previous speech frame of the previous speech frame.
According to the method and the device, the covariance matrix of the frequency domain error corresponding to the ith voice frequency domain block in the last voice frame, the frequency domain reference signal of the ith voice frequency domain block in the last voice frame, the autocorrelation matrix of the background noise of the last voice frame, the frame length and the block length are utilized, so that the frequency domain filter step length matrix corresponding to each voice frequency domain block in the last voice frame of the current voice frame can be conveniently obtained, and the convergence of a filtering algorithm to an optimal solution is facilitated.
In an alternative example, one example of the manner of determining the covariance matrix of the frequency domain error corresponding to the ith speech frequency domain block in the previous speech frame according to the present disclosure may be that, for the ith speech frequency domain block in the previous speech frame, the covariance matrix of the frequency domain error corresponding to the ith speech frequency domain block in the previous speech frame, the frequency domain reference signal of the ith speech frequency domain block in the previous speech frame, the process noise of the ith speech frequency domain block in the previous speech frame, the frequency domain reference signal of the ith speech frequency domain block in the previous speech frame, the frequency domain filter step size matrix corresponding to the ith speech frequency domain block in the previous speech frame, the unit matrix with the frame length multiplied by the frame length, and the block length are determined. Alternatively, the present disclosure may obtain the covariance matrix of the frequency domain error corresponding to the i-th speech frequency domain block in the previous speech frame using the following formula (11):
In the above formula (11), P b (k+1) represents a covariance matrix of a frequency domain error corresponding to a b-th speech frequency domain block in a k+1-th speech frame, A represents an acoustic echo path uncertainty, for example, the value of A may be 1;I M represents an identity matrix with a size of M×M, L represents the length of a speech frequency domain block, i.e., a block length, M represents the length of a speech frame, i.e., a frame length, μ b (k) represents a frequency domain filter step size matrix corresponding to a b-th speech frequency domain block in a k-th speech frame, and X b (k) represents a frequency domain reference signal of a b-th speech frequency domain block in a k-th speech frame; Representing the conjugate transpose of X b (k), P b (k) representing the covariance matrix of the frequency-domain error corresponding to the b-th speech frequency-domain block in the k-th speech frame, and ψ b,Δ (k) representing the process noise of the b-th speech frequency-domain block in the k-th speech frame.
It should be specifically noted that P b (k+1) in the above formula (11) may be used as a covariance matrix of the frequency domain error corresponding to the b-th speech frequency domain block in the previous speech frame in the present disclosure, that is, the k+1-th speech frame represents the previous speech frame of the current frame, and the k-th speech frame represents the previous speech frame of the previous speech frame.
According to the method and the device, the covariance matrix of the frequency domain errors corresponding to the ith voice frequency domain block in the last voice frame of the last voice frame, the frequency domain reference signal of the ith voice frequency domain block in the last voice frame of the last voice frame, the process noise of the ith voice frequency domain block of the last voice frame, the frequency domain reference signal of the ith voice frequency domain block in the last voice frame of the last voice frame, the frequency domain filter step size matrix corresponding to the ith voice frequency domain block in the last voice frame of the last voice frame, the unit matrix with the frame length multiplied by the frame length, the frame length and the block length are utilized, so that the covariance matrix of the frequency domain errors corresponding to each voice frequency domain block in the last voice frame can be conveniently obtained, and the filtering algorithm can be converged to an optimal solution.
In an alternative example, the process of determining the frequency domain filter estimation parameters corresponding to each of the voice frequency domain blocks in the current voice frame according to the present disclosure may include, first, for the i-th voice frequency domain block of the current voice frame, performing matrix multiplication on a frequency domain filter step size matrix corresponding to the i-th voice frequency domain block of the previous voice frame, a frequency domain reference signal of the previous voice frame, a frequency domain error signal of the previous voice frame, and two first time domain constraint matrices. And secondly, adding the result obtained by multiplying the matrix with the frequency domain filter estimation parameter corresponding to the ith voice frequency domain block of the previous voice frame. And finally, determining a frequency domain filter estimation parameter corresponding to the ith voice frequency domain block in the current voice frame according to the added result and the preset echo path uncertainty. The process of determining the frequency domain filter estimation parameters for each respective speech frequency domain block in the current speech frame of the present disclosure may be expressed in the form of the following equation (12):
In the above-mentioned formula (12), Representing frequency domain filter estimation parameters (here, uppercase W, corresponding to the frequency domain) corresponding to the b-th speech frequency domain block in the k+1-th speech frame; a represents the acoustic echo path uncertainty, for example, the value of a may be 1; representing frequency-domain filter estimation parameters corresponding to the b-th speech frequency-domain block in the kth speech frame, G L,0 representing a first time-domain constraint matrix, μ b (k) representing a frequency-domain filter step size matrix corresponding to the b-th speech frequency-domain block in the kth speech frame, X b (k) representing a frequency-domain reference signal (X uppercase corresponding to the frequency domain) of the b-th speech frequency-domain block in the kth speech frame, and E (k) representing a frequency-domain error signal of the kth speech frame.
In the case where the value of a in the above formula (12) is 1, the above formula (12) may be simplified to the form of the following formula (13):
it should be noted that the present disclosure can apply the above formulas (12) and (13) As a covariance matrix of the frequency domain error corresponding to the b-th speech frequency domain block in the current speech frame, that is, the k+1th speech frame represents the current speech frame and the k-th speech frame represents the last speech frame of the current speech frame.
According to the method and the device, the frequency domain filter estimation parameters corresponding to each voice frequency domain block in the current voice frame can be conveniently obtained by utilizing the formula (12) or the formula (13), so that the filtering algorithm can be converged to the optimal solution.
In an optional example, the present disclosure may perform filtering processing on a current speech frame according to frequency domain filter estimation parameters corresponding to each speech frequency domain block in the current speech frame, so as to obtain an example of a process of obtaining a frequency domain error signal of the current speech frame, where the process may be: firstly, multiplying frequency domain reference signals of each voice frequency domain block in a current voice frame with frequency domain filter estimation parameters corresponding to each voice frequency domain block in the current voice frame respectively to obtain multiplication results corresponding to each voice frequency domain block in the current voice frame; and finally, determining the frequency domain error signal of the current voice frame according to the difference of the frequency domain expected signal (namely the actual frequency domain expected signal) of the current voice frame and the multiplication result corresponding to the current voice frame. The above-described process can be expressed in the form of the following formulas (14) and (15):
In the above formula (14), E (k) represents a frequency domain error signal of the kth speech frame, Y (k) represents a frequency domain desired signal of the kth speech frame, Representing the estimated frequency domain desired signal, which may also be referred to as an estimate of the frequency domain desired signal, andCan be expressed in the form of the following formula (15).
In the above formula (15), G 0,L represents a time domain constraint matrix, i.e., a second time domain constraint matrix, B represents the total number of voice frequency domain blocks contained in a voice frame, X b (k) represents a frequency domain reference signal of a B-th voice frequency domain block in a k-th voice frame; Representing the frequency domain filter estimation parameters corresponding to the b-th speech frequency domain block in the k-th speech frame.
By utilizing the formula (14) and the formula (15), the frequency domain error signal of the current voice frame can be conveniently obtained, so that the filter algorithm can be converged to the optimal solution.
In an alternative example, the first time domain constraint matrix of the present disclosure may be a matrix determined according to a discrete fourier transform matrix with a size of a frame length by a frame length, a four-element matrix, and a discrete fourier transform inverse matrix with a size of a frame length by a frame length, for example, the discrete fourier transform matrix with a size of a frame length by a frame length, the four-element matrix, and the discrete fourier transform inverse matrix with a size of a frame length by a frame length are multiplied, and the first time domain constraint matrix is determined by the matrix obtained by the multiplication. The upper right corner element, the lower right corner element and the lower left corner element of the four element matrix are all zero matrixes with the size of block length multiplied by block length, and the upper left corner element is an identity matrix with the size of frame length multiplied by frame length. The first time domain constraint matrix may be expressed in the form of the following equation (16):
In the above formula (16), I L represents an identity matrix of size l×l, L represents the length of a speech frequency domain block, i.e., the block length, F represents a discrete fourier transform matrix of size frame length by frame length, and F -1 represents a discrete fourier transform inverse matrix of size frame length by frame length.
By using the first time domain constraint matrix shown in the formula (16), the frequency domain error signal of the current voice frame can be conveniently obtained under the condition that the filtering calculation complexity is basically not influenced, so that the filtering algorithm can be converged to the optimal solution.
The present disclosure is described below in terms of an analysis of an optimal solution in which the block frequency domain kalman filter coefficients (frequency domain filter estimation parameters) may be obtained.
The present disclosure can multiply F -1 on both sides of the equation of the above equation (16), and then, sort and expect, the form of the following equation (17) can be obtained:
the steady state solution of equation (17) above may be expressed in the form of equation (18) below:
The optimal solution for the filter order N should generally satisfy the Wiener-houf (Wiener-Hopf) equation shown in equation (19) below:
In the above formula (19), R x represents an nxn reference signal autocorrelation matrix; representing the optimal solution of the time-domain filter coefficients (here in lower case w, corresponding to the time domain), p represents the cross-correlation vector between the N x 1 time-domain reference signal and the unknown target system filter output signal to be matched.
The following equation (20) can be obtained by splitting the left and right sides of the equal sign of the above equation (19) into B vectors of length L:
in the formula (20), B represents the number of voice frequency domain blocks contained in one voice frame, and the value range of B is [0, B-1]; r x,b,b and R x,b,m can be represented by the following equation (21) representing the inverse of R x,b,b, the b-th block in the p b cross-correlation vector p, and p b=[piL,......,piL+L-1];wo,m representing the optimal solution of the m-th block in the time domain filter coefficients, an L represents the block length.
In the above formula (21), R x(i)=E{x(n)x(n-i)},Rx (i) represents an autocorrelation function of the reference signal.
The present disclosure may transform the above formula (6) into the form of the following formula (22):
r b=E{XC,0,2(k-b)y(k)}=Lpb formula (22)
The present disclosure can obtain the following formula (23) by substituting the above formula (22) into the above formula (18):
as can be seen from comparing the above formula (20) and formula (23), the steady-state solution and the optimal solution of the filtering algorithm in the present disclosure are consistent, that is, the voice filtering technology provided in the present disclosure can converge the filtering algorithm to the optimal solution in the case of insufficient filter order.
An implementation flow of one example of the speech filtering method of the present disclosure is shown in fig. 4.
The order N of the adaptive filter is assumed to be 10 th order. The frequency domain reference signal can be obtained after discrete fourier transforming the time domain reference signal, assuming that the time domain reference signal is zero-averaged white noise, is obtained by a filter of finite length unit impulse response (Finite Impulse Response, FIR) with a time domain filter coefficient of 0.1.2-0.4.0.7. Assuming that the time domain desired signal (i.e., the time domain actual desired signal) is a time domain signal obtained by filtering the time domain reference signal with a 16-order FIR filter having a filter coefficient of [0.01 0.02-0.04-0.08 0.15-0.3 0.45 0.6 0.6 0.45-0.3.15-0.08-0.04.02-0.01 ], the time domain reference signal can be obtained after performing discrete fourier transform on the time domain signal. The present disclosure may assume that the background noise S (k) in the frequency domain desired signal is uncorrelated white noise with a mean of 0 and a variance of 10 -3.
First, an initialization step is performed.
For example, the covariance matrix P b (0) of the frequency domain error corresponding to the b-th speech frequency domain block in the 0 th speech frame is set as epsilon I, where epsilon may have a value of 10 -1;
for another example, the frequency domain filter estimation parameters corresponding to the b-th speech frequency domain block in the 0 th speech frame Set to 0 (here uppercase W, corresponding to the frequency domain);
For another example, the value of the acoustic echo path uncertainty a is set to 1, and the number of blocks is set to 2 (i.e., the total number of speech frequency domain blocks contained in a speech frame is 2.
Secondly, for each voice frame in the iterative process, the following steps are respectively carried out:
Step 1, performing serial-parallel conversion on the received time domain reference signal x (n) of the nth sampling point, and buffering the time domain reference signal after serial-parallel conversion.
And 2, performing discrete Fourier transform (such as fast discrete Fourier transform) on the cached time domain reference signal of the nth sampling point, and transforming the time domain reference signal into a frequency domain, thereby obtaining frequency domain reference signals of each voice frequency domain block in the latest voice frame. The total number of voice frequency domain blocks contained in a voice frame is 2, and the obtained frequency domain reference signals may be X 0 (k) and X 1 (k).
Step 3, calculating and obtaining a frequency domain expected signal (i.e. an estimated expected signal in the frequency domain) Y (k) of the latest speech frame by using the formula (15).
Specifically, the multiple delay processes, X 0 (k), andPoint multiplication of (a), X 1 (k) andIs a product of (c) and (c) is a product of (c), X B-1 (k) and (c)The above formula (15) can be realized by dot multiplication of (a) and addition of a plurality of results after dot multiplication. Note that, when there is a signal overlap between two adjacent speech frames, there are phenomena of X 1(k)=X0 (k-1), X B-1(k)=X0 (k-b+1), and the like described in fig. 4.
And 4, performing inverse discrete Fourier transform on the frequency domain expected signal Y (k) of the latest voice frame to obtain a time domain expected signal (i.e. an estimated expected signal in the time domain) Y (k) of the latest voice frame, wherein the time domain expected signal forms a serial time domain expected signal after parallel-serial conversion.
And 5, subtracting the actual expected signal in the time domain from the estimated expected signal in the time domain to obtain a time domain error signal E (k) of the latest voice frame, performing zero padding on the time domain error signal E (k), and performing discrete Fourier transform on the time domain error signal after the zero padding, thereby obtaining a frequency domain error signal E (k) of the latest voice frame.
The above step 5 may be equivalent to performing serial-parallel conversion on the time-domain expected signal y (n) of the latest speech frame, performing discrete fourier transform (e.g., fast discrete fourier transform) on the buffered time-domain expected signal of the latest speech frame (i.e., the current speech frame, which will be described below as a kth speech frame), transforming the buffered time-domain expected signal to the frequency domain, obtaining the frequency-domain expected signal (i.e., the estimated expected signal in the frequency domain), subtracting the actual expected signal in the frequency domain, and obtaining the frequency-domain error signal E (k) of the latest speech frame.
Step 6, updating the filter coefficient by using the formula (12) or (13) and updating the obtainedCan be used as the frequency domain filter estimation parameter corresponding to each voice frequency domain block in the latest voice frame in the next iteration process
And 7, updating the frequency domain filter step size matrix mu b (k+1) by using the formula (10) to obtain the frequency domain filter step size matrix corresponding to each voice frequency domain block in the latest voice frame in the next iteration process, and updating the covariance matrix P b (k+1) of the frequency domain errors by using the formula (11) to obtain the covariance matrix of the frequency domain errors corresponding to each voice frequency domain block in the latest voice frame in the next iteration process. In addition, the present disclosure may also update the kalman gain using the following equation (24):
In the above formula (24), K b (k+1) represents the Kalman gain corresponding to the b-th speech frequency domain block in the k+1-th speech frame, and μ b (K) represents the frequency domain filter step size matrix corresponding to the b-th speech frequency domain block in the K-th speech frame; represents the conjugate transpose of X b (k), and X b (k) represents the frequency domain reference signal of the b-th speech frequency domain block in the k-th speech frame.
Typically, the optimal solution for the frequency domain filter coefficients should satisfy the wiener-hough equation, i.e. the optimal solution should be consistent with the wiener solution. If the first time domain constraint matrix is added in the block frequency domain Kalman filtering algorithm, the influence of a non-causal part is eliminated in the process of iteratively updating the block frequency domain Kalman filtering algorithm, so that the equivalent time domain filtering coefficient of the improved block frequency domain Kalman filtering algorithm provided by the disclosure is consistent with a wiener solution, and the filtering algorithm can be converged to an optimal solution faster.
Exemplary apparatus
Fig. 5 is a schematic structural diagram of an embodiment of a speech filtering apparatus of the present disclosure. The apparatus of this embodiment may be used to implement the method embodiments of the present disclosure described above. As shown in FIG. 5, the apparatus of this embodiment includes an acquisition module 500, a determination parameter module 501, and a determination error module 502.
The obtaining module 500 is configured to obtain frequency domain filtering parameters corresponding to each of the frequency domain blocks of the previous speech frame adjacent to the current speech frame.
Optionally, the obtaining module 500 may obtain a frequency domain filter step size matrix corresponding to each of the voice frequency domain blocks in the previous voice frame adjacent to the current voice frame, a frequency domain filter estimation parameter corresponding to each of the voice frequency domain blocks in the previous voice frame, a frequency domain reference signal of each of the voice frequency domain blocks in the previous voice frame, and a frequency domain error signal of the previous voice frame. For example, for the ith speech frequency domain block in the previous speech frame, the obtaining module 500 may determine the frequency domain filter step size matrix corresponding to the ith speech frequency domain block in the previous speech frame according to the covariance matrix of the frequency domain error corresponding to the ith speech frequency domain block in the previous speech frame, the frequency domain reference signal of the ith speech frequency domain block in the previous speech frame, the autocorrelation matrix of the background noise of the previous speech frame, the frame length, and the block length.
Optionally, the obtaining module 500 may obtain the covariance matrix of the frequency domain error corresponding to the ith voice frequency domain block in the previous voice frame by determining, for the ith voice frequency domain block in the previous voice frame, a covariance matrix of the frequency domain error corresponding to the ith voice frequency domain block in the previous voice frame according to the covariance matrix of the frequency domain error corresponding to the ith voice frequency domain block in the previous voice frame, the frequency domain reference signal of the ith voice frequency domain block in the previous voice frame, the process noise of the ith voice frequency domain block in the previous voice frame, the frequency domain reference signal of the ith voice frequency domain block in the previous voice frame, the frequency domain filter step size matrix corresponding to the ith voice frequency domain block in the previous voice frame, and the unit matrix, the frame length and the block length of the frame length of the ith voice frequency domain block in the previous voice frame.
The determining parameter module 501 is configured to determine, according to the frequency domain filtering parameters and two first time domain constraint matrices corresponding to each of the frequency domain blocks of the previous speech frame obtained by the obtaining module 500, the frequency domain filter estimation parameters corresponding to each of the frequency domain blocks of the current speech frame.
Optionally, for the ith voice frequency domain block of the current voice frame, the determining parameter module 501 may perform matrix multiplication on a frequency domain filter step size matrix corresponding to the ith voice frequency domain block of the previous voice frame, a frequency domain reference signal of the previous voice frame, a frequency domain error signal of the previous voice frame, and two first time domain constraint matrices, then the determining parameter module 501 adds a result obtained by matrix multiplication to a frequency domain filter estimation parameter corresponding to the ith voice frequency domain block of the previous voice frame, and then the determining parameter module 501 determines a frequency domain filter estimation parameter corresponding to the ith voice frequency domain block in the current voice frame according to the added result and a preset echo path uncertainty.
Alternatively, the determining parameter module 501 may multiply the discrete fourier transform matrix with the size of the frame length by the frame length, the four-element matrix, and the inverse discrete fourier transform matrix with the size of the frame length by the frame length, and determine the first time domain constraint matrix according to the matrix obtained by the multiplication. The upper right corner element, the lower right corner element and the lower left corner element of the four element matrix are all zero matrixes with the size of block length multiplied by block length, and the upper left corner element is an identity matrix with the size of frame length multiplied by frame length.
The determining error module 502 is configured to perform filtering processing on the current speech frame according to the frequency domain filter estimation parameters corresponding to each speech frequency domain block in the current speech frame determined by the determining parameter module 501, so as to obtain a frequency domain error signal of the current speech frame.
Optionally, the determining error module 502 may multiply the frequency domain reference signal of each voice frequency domain block in the current voice frame with the frequency domain filter estimation parameter corresponding to each voice frequency domain block in the current voice frame to obtain the multiplication result corresponding to each voice frequency domain block in the current voice frame, then, the determining error module 502 accumulates the multiplication result corresponding to each voice frequency domain block, multiplies the accumulated result with the second time domain constraint matrix to obtain the multiplication result corresponding to the current voice frame, and then, the determining error module 502 determines the frequency domain error signal of the current voice frame according to the difference between the frequency domain expected signal of the current voice frame and the multiplication result corresponding to the current voice frame.
Exemplary electronic device
An electronic device according to an embodiment of the present disclosure is described below with reference to fig. 6. Fig. 6 shows a block diagram of an electronic device according to an embodiment of the disclosure. As shown in fig. 6, the electronic device 61 includes one or more processors 611 and memory 612.
The processor 611 may be a Central Processing Unit (CPU) or other form of processing unit having data processing and/or instruction execution capabilities, and may control other components in the electronic device 61 to perform the desired functions.
Memory 612 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, random Access Memory (RAM) and/or cache memory (cache), among others. The nonvolatile memory may include, for example, read Only Memory (ROM), hard disk, flash memory, and the like. One or more computer program instructions may be stored on the computer readable storage medium that can be executed by the processor 611 to implement the voice filtering method and/or other desired functions of the various embodiments of the present disclosure described above. Various contents such as an input signal, a signal component, a noise component, and the like may also be stored in the computer-readable storage medium.
In one example, the electronic device 61 may also include an input device 613, an output device 614, and the like, interconnected by a bus system and/or other form of connection mechanism (not shown). In addition, the input device 613 may include, for example, a keyboard, a mouse, and the like. The output device 614 can output various information to the outside. The output devices 614 may include, for example, a display, speakers, a printer, and a communication network and remote output devices connected thereto, etc.
Of course, only some of the components of the electronic device 61 that are relevant to the present disclosure are shown in fig. 6 for simplicity, components such as buses, input/output interfaces, and the like are omitted. In addition, the electronic device 61 may include any other suitable components depending on the particular application.
Exemplary computer program product and computer readable storage Medium
In addition to the methods and apparatus described above, embodiments of the present disclosure may also be a computer program product comprising computer program instructions which, when executed by a processor, cause the processor to perform the steps in a speech filtering method according to various embodiments of the present disclosure described in the "exemplary methods" section of the present description.
The computer program product may write program code for performing the operations of embodiments of the present disclosure in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server.
Furthermore, embodiments of the present disclosure may also be a computer-readable storage medium, having stored thereon computer program instructions, which when executed by a processor, cause the processor to perform steps in a speech filtering method according to various embodiments of the present disclosure described in the above "exemplary method" section of the present disclosure.
The computer readable storage medium may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may include, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of a readable storage medium may include an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The basic principles of the present disclosure have been described above in connection with specific embodiments, but it should be noted that the advantages, benefits, effects, etc. mentioned in the present disclosure are merely examples and not limiting, and these advantages, benefits, effects, etc. are not to be considered as necessarily possessed by the various embodiments of the present disclosure. Furthermore, the specific details disclosed herein are for purposes of illustration and understanding only, and are not intended to be limiting, since the disclosure is not necessarily limited to practice with the specific details described.
In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different manner from other embodiments, so that the same or similar parts between the embodiments are mutually referred to. For system embodiments, the description is relatively simple as it essentially corresponds to method embodiments, and reference should be made to the description of method embodiments for relevant points.
The block diagrams of the devices, apparatuses, devices, systems referred to in this disclosure are merely illustrative examples and are not intended to require or imply that the connections, arrangements, configurations must be made in the manner shown in the block diagrams. As will be appreciated by one of skill in the art, the devices, apparatus, devices, and systems may be connected, arranged, configured in any manner. Words such as "including," "comprising," "having," and the like are words of openness and mean "including but not limited to," and are used interchangeably therewith. The terms "or" and "as used herein refer to and are used interchangeably with the term" and/or "unless the context clearly indicates otherwise. The term "such as" as used herein refers to, and is used interchangeably with, the phrase "such as, but not limited to.
The methods and apparatus of the present disclosure may be implemented in a number of ways. For example, the methods and apparatus of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, firmware. The above-described sequence of steps for the method is for illustration only, and the steps of the method of the present disclosure are not limited to the sequence specifically described above unless specifically stated otherwise. Furthermore, in some embodiments, the present disclosure may also be implemented as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.
It is also noted that in the apparatus, devices and methods of the present disclosure, components or steps may be disassembled and/or assembled. Such decomposition and/or recombination should be considered equivalent to the present disclosure.
The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these aspects, and the like, will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the disclosure. Thus, the present disclosure is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The foregoing description has been presented for purposes of illustration and description. Furthermore, this description is not intended to limit embodiments of the disclosure to the form disclosed herein. Although a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, changes, additions, and sub-combinations thereof.